'getting dummy variables in R manually using ifelse() condition' 태그의 글 목록

'getting dummy variables in R manually using ifelse() condition'에 해당되는 글 1건

2020.01.19 [R] 여러개의 변수를 가진 DataFrame을 무작위 층화 샘플링으로 Train, Test set 분할하고 표준화하기

[R] 여러개의 변수를 가진 DataFrame을 무작위 층화 샘플링으로 Train, Test set 분할하고 표준화하기

R 분석과 프로그래밍/R 데이터 전처리 2020. 1. 19. 13:08

이번 포스팅에서는 R을 사용하여 예측이나 분류 모델링을 할 때 기본적으로 필요한 두가지 작업인

(1) DataFrame을 Train set, Test set 으로 분할하기

(Split a DataFrame into Train and Test set)

- (1-1) 무작위 샘플링에 의한 Train, Test set 분할

(Split of Train, Test set by Random Sampling)

- (1-2) 순차 샘플링에 의한 Train, Test set 분할

(Split of Train, Test set by Sequential Sampling)

- (1-3) 층화 무작위 샘플링에 의한 Train, Test set 분할

(Split of Train, Test set by Stratified Random Sampling)

(2) 여러개의 숫자형 변수를 가진 DataFrame을 표준화하기

(Standardization of Numeric Data)

- (2-1) z-변환 (z-transformation, standardization)

- (2-2) [0-1] 변환 ([0-1] transformation, normalization)

(3) 여러개의 범주형 변수를 가진 DataFrame에서 가변수 만들기

(Getting Dummy Variables)

에 대해서 소개하겠습니다.

예제로 사용할 Cars93 DataFrame을 MASS 패키지로 부터 불러오겠습니다. 변수가 무척 많으므로 예제를 간단하게 하기 위해 설명변수 X로 'Price', 'Horsepower', 'RPM', 'Length', 'Type', 'Origin' 만을 subset 하여 가져오고, 반응변수 y 로는 'MPG.highway' 변수를 사용하겠습니다.

# get Cars93 DataFrame from MASS package

library(MASS)

data(Cars93)

str(Cars93)

'data.frame': 93 obs. of 27 variables: $ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4... $ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1... $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3... $ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ... $ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ... $ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3... $ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ... $ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ... $ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2... $ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3... $ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4... $ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ... $ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ... $ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800... $ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690... $ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1... $ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ... $ Passengers : int 5 5 5 6 4 6 6 6 5 6 ... $ Length : int 177 195 180 193 186 189 200 216 198 206 ... $ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ... $ Width : int 68 71 67 70 69 69 74 78 73 73 ... $ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ... $ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ... $ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ... $ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495... $ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1...

$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

X <- subset(Cars93, select=c('Price', 'Horsepower', 'RPM', 'Length', 'Type', 'Origin'))

head(X)

A data.frame: 6 × 6
Price	Horsepower	RPM	Length	Type	Origin
<dbl>	<int>	<int>	<int>	<fct>	<fct>
15.9	140	6300	177	Small	non-USA
33.9	200	5500	195	Midsize	non-USA
29.1	172	5500	180	Compact	non-USA
37.7	172	5500	193	Midsize	non-USA
30.0	208	5700	186	Midsize	non-USA
15.7	110	5200	189	Midsize	USA

table(X$Origin)

USA non-USA 48 45

y <- Cars93$MPG.highway

(1) DataFrame을 Train set, Test set 으로 분할하기 (Split a DataFrame into Train and Test set)

(1-1) 무작위 샘플링에 의한 Train, Test set 분할 (Split of Train, Test set by Random Sampling)

간단하게 일회성으로 무작위 샘플링 하는 것이면 sample() 함수로 난수를 생성해서 indexing을 해오면 됩니다.

(* 참고 : https://rfriend.tistory.com/58)

# (1) index for splitting data into Train and Test set

set.seed(1004) # for reprodicibility

train_idx <- sample(1:nrow(X), size=0.8*nrow(X), replace=F) # train-set 0.8, test-set 0.2

test_idx <- (-train_idx)

X_train <- X[train_idx,]

y_train <- y[train_idx]

X_test <- X[test_idx,]

y_test <- y[test_idx]

print(paste0('X_train: ', nrow(X_train)))

print(paste0('y_train: ', length(y_train)))

print(paste0('X_test: ', nrow(X_test)))

print(paste0('y_test: ', length(y_test)))

[Out]:
[1] "X_train: 74"
[1] "y_train: 74"
[1] "X_test: 19"
[1] "y_test: 19"

(1-2) 순차 샘플링에 의한 Train, Test set 분할 (Split of Train, Test set by Sequential Sampling)

시계열 분석을 할 경우 시간 순서(timestamp order)를 유지하는 것이 필요하므로 (1-1)의 무작위 샘플링을 하면 안되며, 시간 순서를 유지한 상태에서 앞서 발생한 시간 구간을 training set, 뒤의(미래의) 시간 구간을 test set 으로 분할합니다.

# sequential sampling

test_size <- 0.2

test_num <- ceiling(nrow(X) * test_size)

train_num <- nrow(X) - test_num

X_train <- X[1:train_num,]

X_test <- X[(train_num+1):nrow(X),]

y_train <- y[1:train_num]

y_test <- y[(train_num+1):length(y)]

(1-3) 층화 무작위 샘플링에 의한 Train, Test set 분할 (Split of Train, Test set by Stratified Random Sampling)

위의 (1-1)과 (1-2)에서 소개한 무작위 샘플링, 순차 샘플링을 사용한 train, test set split 을 random_split() 이라는 사용자 정의함수(user-defined function)으로 정의하였으며, 층화 무작위 샘플링(stratified random sampling)을 사용한 train_test_split() 사용자 정의 함수도 이어서 정의해 보았습니다. (python sklearn의 train_test_split() 함수의 인자, 반환값이 유사하도록 정의해보았습니다) (* 참고 : https://rfriend.tistory.com/58)

# --- user-defined function of train_test split with random sampling

random_split <- function(X, y

, test_size

, shuffle

, random_state) {

test_num <- ceiling(nrow(X) * test_size)

train_num <- nrow(X) - test_num

if (shuffle == TRUE) {

# shuffle == True

set.seed(random_state) # for reprodicibility

test_idx <- sample(1:nrow(X), size=test_num, replace=F)

train_idx <- (-test_idx)

X_train <- X[train_idx,]

X_test <- X[test_idx,]

y_train <- y[train_idx]

y_test <- y[test_idx]

} else {

# shuffle == False

X_train <- X[1:train_num,]

X_test <- X[(train_num+1):nrow(X),]

y_train <- y[1:train_num]

y_test <- y[(train_num+1):length(y)]

}

return (list(X_train, X_test, y_train, y_test))

}

# --- user defined function of train_test_split() with statified random sampling

train_test_split <- function(X, y

, test_size=0.2

, shuffle=TRUE

, random_state=2004

, stratify=FALSE, strat_col=NULL){

if (stratify == FALSE){ # simple random sampling

split <- random_split(X, y, test_size, shuffle, random_state)

X_train <- split[1]

X_test <- split[2]

y_train <- split[3]

y_test <- split[4]

} else { # --- stratified random sampling

strata <- unique(as.character(X[,strat_col]))

X_train <- data.frame()

X_test <- data.frame()

y_train <- vector()

y_test <- vector()

for (stratum in strata){

X_stratum <- X[X[strat_col] == stratum, ]

y_stratum <- y[X[strat_col] == stratum]

split_stratum <- random_split(X_stratum, y_stratum, test_size, shuffle, random_state)

X_train <- rbind(X_train, data.frame(split_stratum[1]))

X_test <- rbind(X_test, data.frame(split_stratum[2]))

y_train <- c(y_train, unlist(split_stratum[3]))

y_test <- c(y_test, unlist(split_stratum[4]))

}

return (list(X_train, X_test, y_train, y_test))

}

위에서 정의한 train_test_splie() 사용자 정의 함수를 사용하여 'Origin' ('USA', 'non-USA' 의 두 개 수준을 가진 요인형 변수) 변수를 사용하여 층화 무작위 샘플링을 통한 train, test set 분할 (split of train and test set using stratified random sampling in R) 을 해보겠습니다,

split_list <- train_test_split(X, y

, test_size=0.2

, shuffle=TRUE

, random_state=2004

, stratify=TRUE, strat_col='Origin')

X_train <- data.frame(split_list[1])

X_test <- data.frame(split_list[2])

y_train <- unlist(split_list[3])

y_test <- unlist(split_list[4])

print(paste0('Dim of X_train: ', nrow(X_train), ', ', ncol(X_train)))

print(paste0('Dim of X_test: ', nrow(X_test), ', ', ncol(X_test)))

print(paste0('Length of y_train: ', length(y_train)))

print(paste0('Length of y_test: ', length(y_test)))

[Out]:

[1] "Dim of X_train: 74, 6"
[1] "Dim of X_test:  19, 6"
[1] "Length of y_train: 74"
[1] "Length of y_test:  19"

X_test

A data.frame: 19 × 6
	Price	Horsepower	RPM	Length	Type	Origin
	<dbl>	<int>	<int>	<int>	<fct>	<fct>
44	8.0	81	5500	168	Small	non-USA
2	33.9	200	5500	195	Midsize	non-USA
39	8.4	55	5700	151	Small	non-USA
40	12.5	90	5400	164	Sporty	non-USA
3	29.1	172	5500	180	Compact	non-USA
53	8.3	82	5000	164	Small	non-USA
45	10.0	124	6000	172	Small	non-USA
90	20.0	134	5800	180	Compact	non-USA
42	12.1	102	5900	173	Small	non-USA
16	16.3	170	4800	178	Van	USA
7	20.8	170	4800	200	Large	USA
11	40.1	295	6000	204	Midsize	USA
73	9.0	74	5600	177	Small	USA
12	13.4	110	5200	182	Compact	USA
8	23.7	180	4000	216	Large	USA
23	9.2	92	6000	174	Small	USA
17	16.6	165	4000	194	Van	USA
74	11.1	110	5200	181	Compact	USA
14	15.1	160	4600	193	Sporty	USA

table(X$Origin)

[Out]:
    USA non-USA 
     48      45

table(X_test$Origin)

[Out]:
    USA non-USA 
     10       9

y_test

[Out]: 33
25
50
36
26
37
29
30
46
23
28
25
41
36
25
33
20
31
28

참고로 (1-1) 무작위 샘플링에 의한 Train, Test set 분할을 위의 (1-3)에서 정의한 train_test_split() 사용자 정의 함수를 사용해서 하면 아래와 같습니다. (shuffle=TRUE)

# split of train, test set by random sampling using train_test_split() function

split_list <- train_test_split(X, y

, test_size=0.2

, shuffle=TRUE

, random_state=2004

, stratify=FALSE)

X_train <- data.frame(split_list[1])

X_test <- data.frame(split_list[2])

y_train <- unlist(split_list[3])

y_test <- unlist(split_list[4])

참고로 (1-2) 순차 샘플링에 의한 Train, Test set 분할을 위의 (1-3)에서 정의한 train_test_split() 사용자 정의 함수를 사용해서 하면 아래와 같습니다. (shuffle=FALSE)

# split of train, test set by sequential sampling using train_test_split() function

split_list <- train_test_split(X, y

, test_size=0.2

, shuffle=FALSE

, random_state=2004

, stratify=FALSE)

X_train <- data.frame(split_list[1])

X_test <- data.frame(split_list[2])

y_train <- unlist(split_list[3])

y_test <- unlist(split_list[4])

(2) 여러개의 숫자형 변수를 가진 DataFrame을 표준화하기 (Standardization of Nuemric Data)

(2-1) z-변환 (z-transformation, standardization)

X_train, X_test 데이터셋에서 숫자형 변수(numeric variable)와 범주형 변수(categorical varialble)를 구분한 후에, 숫자형 변수로 이루어진 DataFrame 에 대해서 z-표준화 변환 (z-standardization transformation)을 해보겠습니다. (* 참고 : https://rfriend.tistory.com/52)

여러개의 변수를 가진 DataFrame이므로 X_mean <- apply(X_train_num, 2, mean) 로 Train set의 각 숫자형 변수별 평균을 구하고, X_stddev <- apply(X_train_num, 2, sd) 로 Train set의 각 숫자형 변수별 표준편차를 구했습니다.

그리고 scale(X_train_num, center=X_mean, scale=X_stddev) 로 Train set의 각 숫자형 변수를 z-표준화 변환을 하였으며, scale(X_test_num, center=X_mean, scale=X_stddev) 로 Test set의 각 숫자형 변수를 z-표준화 변환을 하였습니다.

이때 조심해야 할 것이 있는데요, z-표준화 변환 시 사용하는 평균(mean)과 표준편차(standard deviation)는 Train set으로 부터 구해서 --> Train set, Test set 에 적용해서 z-표준화를 한다는 점입니다. 왜냐하면 Test set는 미래 데이터(future data), 볼 수 없는 데이터(unseen data) 이므로, 우리가 알 수 있는 집단의 평균과 표준편차는 Train set으로 부터만 얻을 수 있기 때문입니다. (많은 분석가가 그냥 Train, Test set 구분하기 전에 통채로 scale() 함수 사용해서 표준화를 한 후에 Train, Test set으로 분할을 하는데요, 이는 엄밀하게 말하면 잘못된 순서입니다)

# split numeric, categorical variables

X_train_num <- X_train[, c('Price', 'Horsepower', 'RPM', 'Length')]

X_train_cat <- X_train[, c('Type', 'Origin')]

X_test_num <- X_test[ , c('Price', 'Horsepower', 'RPM', 'Length')]

X_test_cat <- X_test[ , c('Type', 'Origin')]

# (1) Z Standardization

# (1-1) using scale() function

X_mean <- apply(X_train_num, 2, mean)

X_stddev <- apply(X_train_num, 2, sd)

print('---- Mean ----')

print(X_mean)

print('---- Standard Deviation ----')

print(X_stddev)

[Out]:

[1] "---- Mean ----"
     Price Horsepower        RPM     Length 
  20.22703  146.08108 5278.37838  183.67568 
[1] "---- Standard Deviation ----"
     Price Horsepower        RPM     Length 
  9.697073  51.171149 594.730345  14.356620

X_train_scaled <- scale(X_train_num, center=X_mean, scale = X_stddev)

head(X_train_num_scaled)

A matrix: 6 × 4 of type dbl
	Price	Horsepower	RPM	Length
1	-0.44621989	-0.1188381	1.7177896	-0.46498935
4	1.80188107	0.5065143	0.3726422	0.64947906
5	1.00782706	1.2100357	0.7089291	0.16189913
41	-0.04403669	0.2720072	0.8770725	-0.60429791
43	-0.28122166	-0.1188381	0.5407856	0.09224485
46	-1.05465089	-1.0568667	0.4567139	-1.23118639

# note that 'mean' and 'stddev' are calculated using X_train_num dataset (NOT using X_test_num)

X_test_scaled <- scale(X_test_num, center=X_mean, scale = X_stddev)

head(X_test_num_scaled)

A matrix: 6 × 4 of type dbl
	Price	Horsepower	RPM	Length
44	-1.2608987	-1.2718315	0.3726422	-1.0918778
2	1.4100103	1.0536976	0.3726422	0.7887876
39	-1.2196491	-1.7799303	0.7089291	-2.2760005
40	-0.7968411	-1.0959512	0.2044988	-1.3704949
3	0.9150156	0.5065143	0.3726422	-0.2560265
53	-1.2299615	-1.2522893	-0.4680750	-1.3704949

# combine X_train_scaled, X_train_cat

X_train_scaled <- cbind(X_train_num_scaled, X_train_cat)

# combine X_trest_scaled, X_test_cat

X_test_scaled <- cbind(X_test_num_scaled, X_test_cat)

(2-2) [0-1] 변환 ([0-1] transformation, normalization)

각 숫자형 변수별 최소값(min)과 최대값(max)을 구해서 [0-1] 사이의 값으로 변환해보겠습니다.

(* 참고 : https://rfriend.tistory.com/52)

# (2) [0-1] Normalization

# 0-1 transformation

X_max <- apply(X_train_num, 2, max)

X_min <- apply(X_train_num, 2, min)

X_train_num_scaled <- scale(X_train_num, center = X_min, scale = (X_max - X_min))

X_test_num_scaled <- scale(X_test_num, center = X_min, scale = (X_max - X_min))

head(X_train_num_scaled)

A matrix: 6 × 4 of type dbl
	Price	Horsepower	RPM	Length
1	0.15596330	0.3248945	0.9259259	0.4615385
4	0.55596330	0.4599156	0.6296296	0.6666667
5	0.41467890	0.6118143	0.7037037	0.5769231
41	0.22752294	0.4092827	0.7407407	0.4358974
43	0.18532110	0.3248945	0.6666667	0.5641026
46	0.04770642	0.1223629	0.6481481	0.3205128

head(X_test_num_scaled)

A matrix: 6 × 4 of type dbl
	Price	Horsepower	RPM	Length
44	0.01100917	0.07594937	0.6296296	0.3461538
2	0.48623853	0.57805907	0.6296296	0.6923077
39	0.01834862	-0.03375527	0.7037037	0.1282051
40	0.09357798	0.11392405	0.5925926	0.2948718
3	0.39816514	0.45991561	0.6296296	0.5000000
53	0.01651376	0.08016878	0.4444444	0.2948718

# combine X_train_scaled, X_train_cat

X_train_scaled <- cbind(X_train_num_scaled, X_train_cat)

# combine X_trest_scaled, X_test_cat

X_test_scaled <- cbind(X_test_num_scaled, X_test_cat)

(3) 여러개의 범주형 변수를 가진 DataFrame에서 가변수 만들기 (Getting Dummy Variables)

(3-1) caret 패키지의 dummyVars() 함수를 이용하여 DataFrame 내 범주형 변수로부터 가변수 만들기

library(caret)

# fit dummyVars()

dummy <- dummyVars(~ ., data = X_train_cat, fullRank = TRUE)

# predict (transform) dummy variables

X_train_cat_dummy <- predict(dummy, X_train_cat)

X_test_cat_dummy <- predict(dummy, X_test_cat)

head(X_train_cat_dummy)

A matrix: 6 × 6 of type dbl
Type.Midsize	Type.Small	Origin.non-USA
0	1	1
1	0	1
0	0	1
1	0	1
1	0	1
1	0	0

head(X_test_cat_dummy)

A matrix: 6 × 6 of type dbl
	Type.Large	Type.Midsize	Type.Small	Type.Sporty	Origin.non-USA
75	0	0	0	1	0
76	0	1	0	0	0
77	1	0	0	0	0
78	0	0	0	0	1
79	0	0	1	0	0
80	0	0	1	0	1

(3-2) 조건문 ifelse() 함수를 이용하여 수작업으로 가변수 만들기

(creating dummy variables manually using ifelse())

아무래도 (3-1)의 caret 패키지를 이용하는 것 대비 수작업으로 할 경우 범주형 변수의 개수와 범주형 변수 내 class 의 종류 수가 늘어날 수록 코딩을 해야하는 수고가 기하급수적으로 늘어납니다. 그리고 범주형 변수나 class가 가변적인 경우 데이터 전처리 workflow를 자동화하는데 있어서도 수작업의 하드코딩의 경우 에러를 야기하는 문제가 되거나 추가적인 비용이 될 수 있다는 단점이 있습니다.

범주형 변수 내 범주(category) 혹은 계급(class)이 k 개가 있으면 --> 가변수는 앞에서 부터 k-1 개 까지만 만들었습니다. (회귀모형의 경우 dummy trap 을 피하기 위해)

# check level (class) of categorical variables

unique(X_train_cat$Type)

[Out]: Small
Midsize
Compact
Large
Sporty
Van

unique(X_train_cat$Origin)

[Out]: non-USA
USA

# get dummy variables from train set

X_train_cat_dummy <- data.frame(

type_small = ifelse(X_train_cat$Type == "Small", 1, 0)

, type_midsize = ifelse(X_train_cat$Type == "Midsize", 1, 0)

, type_compact = ifelse(X_train_cat$Type == "Compact", 1, 0)

, type_large = ifelse(X_train_cat$Type == "Large", 1, 0)

, type_sporty = ifelse(X_train_cat$Type == "Sporty", 1, 0)

, origin_nonusa = ifelse(X_train_cat$Origin == "non-USA", 1, 0)

)

head(X_train_cat_dummy)

A data.frame: 6 × 6
type_small	type_midsize	type_compact	type_large	type_sporty	origin_nonusa
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	0	0	0	0	1
0	1	0	0	0	1
0	0	1	0	0	1
0	1	0	0	0	1
0	1	0	0	0	1
0	1	0	0	0	0

# get dummy variables from test set

X_test_cat_dummy <- data.frame(

type_small = ifelse(X_test_cat$Type == "Small", 1, 0)

, type_midsize = ifelse(X_test_cat$Type == "Midsize", 1, 0)

, type_compact = ifelse(X_test_cat$Type == "Compact", 1, 0)

, type_large = ifelse(X_test_cat$Type == "Large", 1, 0)

, type_sporty = ifelse(X_test_cat$Type == "Sporty", 1, 0)

, origin_nonusa = ifelse(X_test_cat$Origin == "non-USA", 1, 0)

)

head(X_test_cat_dummy)

A data.frame: 6 × 6
type_small	type_midsize	type_compact	type_large	type_sporty	origin_nonusa
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
0	0	0	0	1	0
0	1	0	0	0	0
0	0	0	1	0	0
0	0	1	0	0	1
1	0	0	0	0	0
1	0	0	0	0	1

(4) 숫자형 변수와 범주형 변수 전처리한 데이터셋을 합쳐서 Train, Test set 완성하기

# combine X_train_scaled, X_train_cat

X_train_preprocessed <- cbind(X_train_num_scaled, X_train_cat_dummy)

head(X_train_preprocessed)

A data.frame: 6 × 10
Price	Horsepower	RPM	Length	type_small	type_midsize	type_compact	type_large	type_sporty	origin_nonusa
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
0.1559633	0.3469388	0.9259259	0.4615385	1	0	0	0	0	1
0.4862385	0.5918367	0.6296296	0.6923077	0	1	0	0	0	1
0.3981651	0.4775510	0.6296296	0.5000000	0	0	1	0	0	1
0.5559633	0.4775510	0.6296296	0.6666667	0	1	0	0	0	1
0.4146789	0.6244898	0.7037037	0.5769231	0	1	0	0	0	1
0.1522936	0.2244898	0.5185185	0.6153846	0	1	0	0	0	0

# combine X_trest_scaled, X_test_cat

X_test_preprocessed <- cbind(X_test_num_scaled, X_test_cat_dummy)

head(X_test_preprocessed)

A data.frame: 6 × 10
	Price	Horsepower	RPM	Length	type_small	type_midsize	type_compact	type_large	type_sporty	origin_nonusa
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
75	0.18899083	0.42857143	0.2962963	0.70512821	0	0	0	0	1	0
76	0.20366972	0.59183673	0.4444444	0.69230769	0	1	0	0	0	0
77	0.31192661	0.46938776	0.3703704	0.46153846	0	0	0	1	0	0
78	0.39082569	0.34693878	0.8148148	0.55128205	0	0	1	0	0	1
79	0.06788991	0.12244898	0.4444444	0.44871795	1	0	0	0	0	0
80	0.01834862	0.07346939	0.6666667	0.06410256	1	0	0	0	0	1

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

[R data.table] data.table 은 무엇이고, 왜 data.table 인가? (0)	2020.08.24
[R dplyr] 여러개의 if, else if 조건절을 벡터화해서 처리해주는 case_when() 함수 (6)	2020.02.23
[R] 장비 On, Off 상태로 가동 시간을 구하고, 장비 별 평균 가동 시간 구하기 (16)	2019.10.24
[R] R 패키지 함수의 소스 코드를 볼 수 있는 방법 (how to see function's source codes in R package) (0)	2019.10.22
[R] 시계열 정수의 순차 3개 묶음 패턴 별 개수를 구하고 내림차순 정렬하기 (16)	2019.10.15

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'getting dummy variables in R manually using ifelse() condition'에 해당되는 글 1건

[R] 여러개의 변수를 가진 DataFrame을 무작위 층화 샘플링으로 Train, Test set 분할하고 표준화하기

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바