'R 분석과 프로그래밍' 카테고리의 글 목록 (22 Page)

'R 분석과 프로그래밍'에 해당되는 글 289건

2015.09.19 R 균등분포 (uniform distribution) : unif() 5
2015.09.17 ggplot2 저장 : ggsave(), console 내용 text 저장 : capture.output()
2015.09.17 R clearing of console, datasets, plots in RStudio
2015.09.16 R 정규분포(Normal distribution) : norm()
2015.09.15 R 포아송 분포 (Poisson distribution) : pois() 7
2015.09.13 R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k)
2015.09.13 R 이항분포 (Binomial Distribution) : dbinom() 6
2015.09.12 R 동적 그래프 (Interactive Plotting in R) : with manipulate package in Rstudio
2015.09.11 R 사용자 정의 함수 (User Defined Function) 4
2015.09.07 R ggplot2 연속확률분포 곡선, stat_function()

R 균등분포 (uniform distribution) : unif()

R 분석과 프로그래밍/R 통계분석 2015. 9. 19. 23:02

연속형 확률분포 (Continuous probability distribution)에는

- 정규분포 (normal distribution)

: norm()

- 균등분포 (uniform distribution)

: unif()

- 지수분포 (exponential distribution)

: exp()

- t-분포 (t-distribution)

: t()

- F-분포 (F-distribution)

: f()

- 카이제곱분포(chisq-distribution)

: chisq()

등이 있습니다.

이번 포스팅에서는 균등분포(uniform distribution)에 대해서 알아보겠습니다. 균등분포(uniform distribution)은 연속형 확률 분포 중에서 가장 간단한 형태로서, 구간 [mi=a, max=b]에서 값이 균등하게 퍼져 있는 집단, 일어날 확률이 균등한 분포를 말합니다.

가령, 김포공항에서 제주도 공항까지 비행기로 이륙에서 착륙까지 걸리는 총 비행시간이 1시간~1시간5분 사이라고 하면, 0시~59분59초까지는 비행기가 도착할 확률이 0, 1시간~1시간5분 사이에 도착할 확률은 1, 1시간 5분 이후는 다시 확률이 0이 되는 균등분포를 따른다고 할 수 있겠습니다.

R에서 사용하는 균등분포 함수(uniform distribution function) 및 파라미터(parameter)들은 아래와 같으며, 필요한 함수, 파라미터를 가져다 사용하면 되겠습니다.

함수 구분		균등분포 함수/파라미더
함수 구분		unif()
밀도함수 (density function)	d	dunif(x, min, max)
누적분포함수 (cumulative distribution function)	p	punif(q, min, max, lower.tail=TRUE/FALSE)
분위수 함수 (quantile function)	q	qunif(p, min, max, lower.tail=TRUE/FALSE)
난수 발생 (random number generation)	r	runif(n, min, max)

(1) 균등분포 그래프(uniform distribution plot) : fun = dunif

ggplot2의 fun= dunif() 함수를 사용해서 균등분포를 그래프로 그려보면 아래와 같이 특정 구간 [a, b]에서 확률이 균등함을 알 수 있습니다.

 
> library(ggplot2)
> # uniform distribution plot (min=0, max=10)
> # 균등분포 : fun = dunif
> ggplot(data.frame(x=c(-2,20)), aes(x=x)) +
+   stat_function(fun=dunif, args=list(min = 0, max = 10), colour="black", size=1) +
+   ggtitle("Uniform Distribution of (min=1, max=10)")

(2) 누적 균등분포 그래프(cumulative uniform distribution plot) : fun = punif

누적 균등분포 그래프를 그려보면 아래와 같습니다.

> # (2) 누적균등분포 함수 그래프 (Cumulative Uniform distribution plot) : fun = punif
> ggplot(data.frame(x=c(-2,20)), aes(x=x)) +
+   stat_function(fun=punif, args=list(min = 0, max = 10), colour="black", size=1) +
+   ggtitle("Cumulative Uniform Distribution of (min=0, max=10)")

(3) 누적 균등분포 함수의 확률 값 계산 : punif()

> # (3) 누적 균등분포함수(cumulative uniform distribution function) 확률 값 계산 : punif()
> # : punif(q, min, max, lower.tail = TRUE/FALSE)
> punif(3, min=0, max=10, lower.tail=TRUE)
[1] 0.3
> 
>

> # Uniform Distribution of (min=1, max=10), x from 0 to 3"

> ggplot(data.frame(x=c(-2,20)), aes(x=x)) +
+   stat_function(fun=dunif, args=list(min = 0, max = 10), colour="black", size=1) +
+   annotate("rect", xmin=0, xmax=3, ymin=0, ymax=0.1, alpha=0.2, fill="yellow") +
+   ggtitle("Uniform Distribution of (min=1, max=10), x from 0 to 3")

(4) 균등분포 분위수 함수 값 계산 : qunif(p, min, max, lower.tail=TRUE/FALSE)

이전 포스팅의 정규분포와는 함수는 qunif()로 동일하지만 괄호 안의 parameter 들은 다릅니다.

(참고: 정규분포에서는 qunif(p, mean, sd, lower.tail=T/F)

> # (4) 분위수 함수 : qunif(p, min, max, lower.tail=TRUE/FALSE)
> qunif(0.3, min=0, max=10, lower.tail = TRUE)
[1] 3

(5) 난수 발생 : runif(n, min, max)

난수는 매번 실행할 때마다 바뀌므로 제가 아래에 제시한 것과는 다른 숫자, 다른 그래프가 그려질 것입니다만, 형태는 균등분포를 띠는 유사한 모양이 될 것입니다.

> ru_100 <- runif(n=100, min=0, max = 10)
> ru_100

 [1] 7.33957568 2.78596723 6.30797744 5.01438337 6.57949706 5.90883342 3.51446293 9.28736811
  [9] 9.55213668 5.59377524 4.71003185 3.29525512 0.25759555 9.40326151 6.56466466 2.44973803
 [17] 4.88714900 3.10710648 3.84375758 8.55017741 3.09487276 0.13411621 0.44285713 8.90632265
 [25] 0.07968823 5.03465390 4.64601169 1.23565062 4.81310463 1.59225023 7.03799510 0.68870704
 [33] 4.03014086 9.97756283 5.55815726 2.01819345 7.00497545 8.50399118 2.29608430 2.92359120
 [41] 0.85656712 6.52544881 6.37193951 6.15247601 5.29502105 7.68988134 6.37691223 0.37387705
 [49] 6.89023959 1.65049129 3.75195268 7.97220092 6.50160025 9.52491436 1.70569894 9.80475205
 [57] 0.24770673 8.47412000 4.66718922 2.52269224 2.81985175 8.79845402 6.03852213 8.10848875
 [65] 1.10510449 9.35548906 1.83535387 0.47889795 6.54578585 1.61742080 4.51840400 3.99912651
 [73] 4.82545376 4.04589108 0.71750065 7.56085867 1.22887762 2.97822070 5.14541682 3.59126885
 [81] 5.00911758 1.02152702 7.78324707 4.69437196 1.13090493 3.70933500 0.03173870 5.74159309
 [89] 2.68879279 3.36398725 9.34593590 6.18818473 9.43490689 5.82578697 4.49576854 2.90029081
 [97] 3.34726356 7.19013351 9.97276521 9.39421932

> # density plot of runif(n=100, min=0, max = 10) & adding line of 0.1 uniform probability
> hist(ru_100, freq=FALSE, breaks=10, col="yellow")

> abline(h=0.1, lty=3, lwd=3, col="red")

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾸욱 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R t-분포 (Student’s t-distribution) : t() (0)	2015.09.24
R 지수분포 (exponential distribution) : exp() (6)	2015.09.22
R 정규분포(Normal distribution) : norm() (0)	2015.09.16
R 포아송 분포 (Poisson distribution) : pois() (7)	2015.09.15
R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k) (0)	2015.09.13

Posted by Rfriend

ggplot2 저장 : ggsave(), console 내용 text 저장 : capture.output()

R 분석과 프로그래밍/R 프로그래밍 2015. 9. 17. 21:21

한두개 정도 일회성으로 그래프 그리고 말거면 그냥 화면 캡쳐하는 프로그램 사용하거나 아니면 RStudio의 파일 내보내기를 사용하면 됩니다.

한두번 분석하고 말거면 그냥 마우스로 Console 창 분석결과에 블럭 설정하고 Copy & Paste 하면 됩니다.

하지만, 수백개, 수천개의 그래프를 그리고 이를 파일로 저장해야 하고 자동화(사용자 정의 함수, 루프) 해야 한다거나, 분석이나 모형개발을 수백개, 수천개 해야 하고 이의 결과를 따로 저장해야 한다면 이걸 수작업으로 매번 할 수는 없는 노릇입니다. 시간도 많이 걸리고, 아무래도 사람 손이 자꾸 타다 보면 실수도 하기 마련이기 때문입니다.

이에 이번 포스팅에서는

(1) ggplot2로 그린 그래프를 jpg 나 pdf 파일로 저장하는 방법

: ggsave()

(2) Console 창에 나타나는 분석 결과, 모형 개발 결과를 text 파일로 저장하는 방법

: capture.output()

에 대해서 소개하겠습니다. R script로 위 작업을 수행할 수 있다면 프로그래밍을 통해 자동화도 할 수 있겠지요.

예제로 사용할 데이터는 MASS 패키지 내 Cars93 데이터프레임의 고속도로연비(MPG.highway), 무게(Weight), 엔진크기(EngineSize), 마련(Horsepower), 길이(Length), 폭(Width) 등의 변수를 사용해서 선형 회귀모형을 만들고 이의 적합 결과를 text 파일로 내보내기를 해보겠습니다.

> # dataset : Cars93 dataframe,  Weight, MPG.highway variable
> library(MASS)
> str(Cars93)
'data.frame':	93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

고속도로연비(PMG.highway)와 차 무게(Weight) 간의 산포도를 ggplot으로 그리면 아래와 같이 RStudio 우측 하단의 Plot 창에 그래프가 생성됩니다.

> # Scatter Plot of Weight & MPG.highway
> library(ggplot2)
> 
> ggplot(Cars93, aes(x=Weight, y=MPG.highway)) +
+   geom_point(shape=19, size=3, colour="blue") +
+   ggtitle("Scatter Plot of Weight & MPG.highway")

이를 ggsave() 함수를 사용해서 jpg 파일로 저장해서 내보내기를 해보겠습니다. pdf 파일로 저장하려면 jpg 대신에 pdf 를 사용하면 됩니다.

 
> # saving ggplot with jpg format file : ggsave()
> ggsave(file="C:/Users/user/Documents/R/scatter_plot.jpg", # directory, filename
+        width=20, height=15, units=c("cm"))  # width, height, units

단순 선형회귀모형과 다변량 선형회귀모형을 각각 적합시켜보면 아래와 같습니다.

> # linear regression modeling
> fit_1 <- lm(MPG.highway ~ Weight, Cars93)
> summary(fit_1)

Call:
lm(formula = MPG.highway ~ Weight, data = Cars93)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6501 -1.8359 -0.0774  1.8235 11.6172 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 51.6013654  1.7355498   29.73   <2e-16 ***
Weight      -0.0073271  0.0005548  -13.21   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.139 on 91 degrees of freedom
Multiple R-squared:  0.6572,	Adjusted R-squared:  0.6534 
F-statistic: 174.4 on 1 and 91 DF,  p-value: < 2.2e-16

> 
> 
> fit_2 <- lm(MPG.highway ~ Weight + EngineSize + Horsepower + Length + Width, Cars93)
> fit_3=stepAIC(fit_2, direction="both")
Start:  AIC=210.77
MPG.highway ~ Weight + EngineSize + Horsepower + Length + Width

             Df Sum of Sq     RSS    AIC
- Horsepower  1      1.38  789.67 208.93
- EngineSize  1      2.62  790.91 209.07
- Width       1      7.03  795.33 209.59
<none>                     788.30 210.77
- Length      1     44.23  832.53 213.84
- Weight      1    562.35 1350.65 258.84

Step:  AIC=208.93
MPG.highway ~ Weight + EngineSize + Length + Width

             Df Sum of Sq     RSS    AIC
- EngineSize  1      1.62  791.29 207.12
- Width       1      7.95  797.62 207.86
<none>                     789.67 208.93
+ Horsepower  1      1.38  788.30 210.77
- Length      1     48.74  838.41 212.50
- Weight      1    699.19 1488.87 265.90

Step:  AIC=207.12
MPG.highway ~ Weight + Length + Width

             Df Sum of Sq     RSS    AIC
- Width       1     13.71  805.00 206.72
<none>                     791.29 207.12
+ EngineSize  1      1.62  789.67 208.93
+ Horsepower  1      0.38  790.91 209.07
- Length      1     52.31  843.61 211.07
- Weight      1    749.41 1540.70 267.09

Step:  AIC=206.72
MPG.highway ~ Weight + Length

             Df Sum of Sq     RSS    AIC
<none>                     805.00 206.72
+ Width       1     13.71  791.29 207.12
+ EngineSize  1      7.38  797.62 207.86
+ Horsepower  1      0.21  804.79 208.69
- Length      1     91.62  896.62 214.74
- Weight      1   1039.48 1844.48 281.82
> summary(fit_3)

Call:
lm(formula = MPG.highway ~ Weight + Length, data = Cars93)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0988 -1.8630 -0.2093  1.4199 11.3613 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.5217809  4.6998055   7.984 4.41e-12 ***
Weight      -0.0096328  0.0008936 -10.780  < 2e-16 ***
Length       0.1155263  0.0360972   3.200   0.0019 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.991 on 90 degrees of freedom
Multiple R-squared:  0.6922,	Adjusted R-squared:  0.6854 
F-statistic: 101.2 on 2 and 90 DF,  p-value: < 2.2e-16

이를 capture.output() 함수를 사용하여 text 파일로 내보내서 차곡 차곡 쌓아가면서 저장하는 방법은 아래와 같습니다. 결과 text 파일을 화면캡채해서 같이 올립니다.

> 
> # capture.output()
> # (1) Simple Linear Regression Model (y=MPG.highway, x=Weight)
> cat("\n", 
+     "\n",
+     "==============================================================", "\n", 
+     " [ Simple Linear Regression Model (y=MPG.highway, x=Weight)]  ", "\n", 
+     "==============================================================", "\n", 
+     file="C:/Users/user/Documents/R/lm_MPG_highway.txt", append = TRUE)
> 
> capture.output(summary(fit_1), 
+                file="C:/Users/user/Documents/R/lm_MPG_highway.txt", append = TRUE)
> 
> 
> # (2) Multivariate Linear Regression Model (y=MPG.highway, x1~x5)
> cat("\n", 
+     "\n",
+     "===============================================================", "\n", 
+     " [ Multivariate Linear Regression Model (y=MPG.highway, x1~x5)]  ", "\n", 
+     "===============================================================", "\n", 
+     file="C:/Users/user/Documents/R/lm_MPG_highway.txt", append = TRUE)

> 
> capture.output(summary(fit_3), 
+                file="C:/Users/user/Documents/R/lm_MPG_highway.txt", append = TRUE)

노가다 하기 싫다면 프로그래밍하고 자동화하는 것이 정답이지요.

위의 작업을 반복해야 한다면 사용자 정의 함수를 덧붙여서 파일 경로 끝 부분에 파일 이름 부분을 paste() 함수를 써서 사용자 정의 함수에 입력한 값으로 매 루프 돌때 마다 바꿔치기 될 수 있도록 프로그래밍을 해주면 되겠습니다.

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡' 단추를 꾸욱 눌러주세요.^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 프로그래밍' 카테고리의 다른 글

[R] 폴더 내 여러개 파일들을 Loop 돌려서 자동으로 불러오기 (154)	2016.07.02
R "target of assignment expands to non-language object" error : assign() (2)	2015.09.22
R 사용자 정의 함수 (User Defined Function) (4)	2015.09.11
R 조건 연산 프로그래밍 (conditional process programming) : if(), elseif() (19)	2015.09.05
R 반복 연산 프로그래밍(Loop process programming) : for(), while() (17)	2015.09.05

Posted by Rfriend

R clearing of console, datasets, plots in RStudio

R 분석과 프로그래밍/R 데이터 전처리 2015. 9. 17. 21:17

이번 포스팅에서는 RStudion 에서 한줄의 Script로 console 창, Environment 창의 datasets, Plots 들을 모조리 삭제/청소하는 방법을 소개하겠습니다.

아래의 명령문을 사용하지 않는다면 rm() 함수로 데이터셋 이름을 일일이 나열하는 노가다를 해야만 합니다. ^,^;

RStudio에서 Datasets 이나 Plots 을 한꺼번에 삭제/청소하려면 붓 모양의 아이콘을 누르면 되긴 합니다만, programing 이나 사용자 정의함수(user defined function) 내에 모든 객체 삭제/청소 기능을 넣고 싶다면 붓 모양 아이콘을 누르는 행위를 Script로 옮길 수 있어야 겠지요.

(1) 좌측 하단의 Console 창에 있는 messages 들을 삭제(clear, delete)하고 싶을 때

# clearing of console in Rstudio
cat("\014")

(2) 우측 상단의 Environment 창에 있는 모든 Datasets 을 삭제(clear, delete)하고 싶을 때

# clearing of datasets in Rstudio Environment
rm(list=ls())

(3) 우측 하단의 Plots 창에 있는 모든 Plots들을 삭제(clear, delete)하고 싶을 때

# clearing of plots in Rstudio
dev.off()

을 사용하면 되겠습니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡' 단추를 꾸욱 눌러주세요.^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

R Package 네트워크 보안 때문에 설치가 안되는 경우 어떻게 해야 하나? (9)	2016.03.24
R 중복 없는 유일한 관측치 선별하기(extracting unique elements) : unique(), !duplicated() (24)	2016.01.30
R 데이터 재구조화 reshape 패키지 melt(), cast() 함수, reshape2의 acast(), dcast() 함수, tidyverse의 spread() 함수 (19)	2015.08.29
R 에서 SQL 사용 sqldf 패키지, 집계 aggregate() 함수 (3)	2015.08.28
R 데이터 변환 : (6) 시그널 데이터 변환 - FFT (Fast Fourier Transform) (9)	2015.08.16

Posted by Rfriend

R 정규분포(Normal distribution) : norm()

R 분석과 프로그래밍/R 통계분석 2015. 9. 16. 00:49

연속형 확률분포 (Continuous probability distribution)에는

- 정규분포 (normal distribution)

: norm()

- 균등분포 (uniform distribution)

: unif()

- 지수분포 (exponential distribution)

: exp()

- t-분포 (t-distribution)

: t()

- F-분포 (F-distribution)

: f()

- 카이제곱분포(chisq-distribution)

: chisq()

등이 있습니다.

정규분포(normal distribution)는 추정과 검정을 하는 추정통계학, 회귀분석과 같은 모형 적합 시 근간이 되는 확률 분포입니다. 우리의 일상 주변에서 흔히 접할 수 있는 확률분포이며, 중심 극한의 정리(Central Limit Theorem)에 따라 샘플의 갯수 n이 증가하면 이항분포, 초기하분포, 포아송분포 등의 이산형 확률분포의 평균과 t-분포, F-분포 등의 연속형 확률분포의 평균이 정규분포로 근사하게 됩니다. 따라서 정규분포는 통계에 있어서 정말 중요하고 많이 사용되는 확률분포라고 할 수 있겠습니다.

R을 가지고 정규분포 그래프 그리기, 확률밀도함수, 누적분포함수, 분위수함수, 난수 발생 등을 예제를 들어 해보겠습니다.

함수 구분		R 함수/ 파라미더
함수 구분		norm()
밀도 함수 (Density function)	d	dnorm(x, mean=0, sd=1)
누적 분포 함수 (Cumulative distribution function)	p	pnorm(q, mean=0, sd=1, lower.tail=TRUE/FALSE)
분위수 함수 (Quantile function)	q	qnorm(p, mean=0, sd=1, lower.tail=TRUE/FALSE)
난수 발생 (Random number generation)	r	rnorm(n, mean=0, sd=1)

(1) 정규분포 그래프 (normal distribution plot) : plot(x, dnorm(x))

> # (1) 정규분포 그래프 (Normal distribution plot, X~N(0,1))
> 
> x <- seq(-3, 3, length=200)
> plot(x, dnorm(x, mean=0, sd=1), type='l', main="Normal distribution, X~N(0,1)")

위의 그래프가 (표준)정규분포 곡선이라면, 아래는 누적 정규분포 곡선(cumulative normal distribution plot)이 되겠습니다.

> # Cumulative normal distribution plot, X~N(0,1)
> x <- seq(-3, 3, length=200)
> plot(x, pnorm(x, mean=0, sd=1), type='l', main="Cumulative normal distribution, X~N(0,1)")

(2) 정규분포의 누적분포함수(cumulative function of normal distribution) 값 계산 : pnorm()

> # (2) 정규분포의 누적분포함수(cumulative function of normal distribution) 값 계산 : pnorm()

> # P(-1 <= X <= +1)
> pnorm(q=c(1), mean=0, sd=1) # 0.8413447
[1] 0.8413447
> pnorm(q=c(-1), mean=0, sd=1) # 0.1586553
[1] 0.1586553
> 
> pnorm(q=c(1), mean=0, sd=1) - pnorm(q=c(-1), mean=0, sd=1) # 0.6826895
[1] 0.6826895
>

> # P(-2 <= X <= +2)
> pnorm(q=c(2), mean=0, sd=1) # 0.9772499
[1] 0.9772499
> pnorm(q=c(-2), mean=0, sd=1) # 0.02275013
[1] 0.02275013
> 
> pnorm(q=c(2), mean=0, sd=1) - pnorm(q=c(-2), mean=0, sd=1) # 0.9544997
[1] 0.9544997
>

> # P(-3 <= X <= +3)
> pnorm(q=c(3), mean=0, sd=1) # 0.9986501
[1] 0.9986501
> pnorm(q=c(-3), mean=0, sd=1) # 0.001349898
[1] 0.001349898
> 
> pnorm(q=c(3), mean=0, sd=1) - pnorm(q=c(-3), mean=0, sd=1) # 0.9973002
[1] 0.9973002

> # lower.tail=FALSE

> pnorm(q=c(1), mean=0, sd=1, lower.tail = TRUE)
[1] 0.8413447
> pnorm(q=c(1), mean=0, sd=1, lower.tail = FALSE)
[1] 0.1586553

pnorm(q, mean, sd, lower.tail = TRUE) 이면 분위수 q를 기준으로 왼쪽의 -inf 부터 q까지의 면적에 대한 합계 값을 보여주며, pnorm(q, mean, sd, lower.tail = FALSE) 이면 분위수 q를 기준으로 q부터 +inf 까지의 오른쪽으로의 면적 합계 값을 보여주게 됩니다.

(3) 분위수 함수 : qnorm(p, mean=0, sd=1, lower.tail=TRUE/FALSE)

정규분포를 따르는 모집단에서 특정 누적분포함수 값 p에 해당하는 분위수 q 를 알고 싶을 때 사용하는 R 함수가 분위수 함수 qnorm()이 되겠습니다. 분위수 함수는 누적분포함수의 역함수이고, 반대로 누적분포함수는 분위수 함수의 역함수라고 말할 수 있습니다. 아래에 누적분포함수와 분위수 함수가 서로 왜 역함수 관계인지 예를 들어보았습니다.

> # (3) 분위수 함수 : qnorm(p, mean=0, sd=1, lower.tail=TRUE/FALSE)
> pnorm(q=c(1), mean=0, sd=1) # 누적분포함수
[1] 0.8413447
> 
> qnorm(p=0.8413447, mean=0, sd=1, lower.tail = TRUE) # 분위수함수
[1] 0.9999998

> 
> qnorm(pnorm(1))
[1] 1

(4) 정규분포 난수 발생 : rnorm()

> # 100s random sampling from normal distribution X~N(0,1)

> random_norm_100 <- rnorm(100, mean=0, sd=1)

> random_norm_100
  [1] -0.49990947  1.30162824 -0.55303626 -0.67823807  1.09867201  0.10112825  1.60729584
  [8] -0.49131533 -0.23875557 -0.10318560  0.37495367  2.37449966  0.17832867 -1.13884498
 [15] -0.04055883 -0.64884566 -0.77738880 -1.07587347 -0.64434199 -1.38282292  0.16584547
 [22]  0.44776193 -0.78980486  1.73319388 -0.57968848  1.25727796 -0.05320889  2.61784767
 [29]  0.78992548  0.42473023 -1.45674849  1.45782133  2.58232132 -1.85544752 -0.46611618
 [36]  0.54686807 -0.72847864  0.12996224  0.19426881  0.01652534 -0.03819245 -0.60196303
 [43] -1.33088212  0.33449997  0.08826498 -0.12122490  0.45268734 -0.27621040  0.65957252
 [50]  0.73278278  1.23812581  0.09450144  1.44667268 -0.71373007 -0.04135331  1.07079267
 [57]  0.85465336  0.10066264  0.07047791 -0.19465235  1.83187324 -0.06047522 -0.89295237
 [64] -1.35422679 -0.26235751  1.06455750  0.83675769 -0.16588313 -0.77936548  0.16614752
 [71]  0.18333754  0.25274271  1.24194101 -0.36543022 -1.25669837  0.16981720 -0.83342688
 [78]  2.58352657 -2.00730559  0.03383145  0.44008506  0.60350848  1.12223002  0.38470856
 [85] -1.06631289 -0.08023159  0.28374720  1.68415043 -0.06373142  0.25866477  0.04997717
 [92]  0.47737531 -1.07703969  0.25487228 -1.00018975 -0.81282824 -0.77747525 -0.44254534
 [99] -0.56190014  0.67733634

> hist(random_norm_100)

난수 발생은 매번 할 때마다 바뀌게 되므로 위의 예제를 따라한다면 아마 위에 나열된 숫자, 위에 제시된 히스토그램과는 다른 숫자, 그래프가 나타날 것입니다만, 평균 0을 중심으로 좌우 대칭형태의 정규분포를 띠는 것은 유사할 것입니다.

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾸욱 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 지수분포 (exponential distribution) : exp() (6)	2015.09.22
R 균등분포 (uniform distribution) : unif() (5)	2015.09.19
R 포아송 분포 (Poisson distribution) : pois() (7)	2015.09.15
R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k) (0)	2015.09.13
R 이항분포 (Binomial Distribution) : dbinom() (6)	2015.09.13

Posted by Rfriend

R 포아송 분포 (Poisson distribution) : pois()

R 분석과 프로그래밍/R 통계분석 2015. 9. 15. 00:49

이산형 확률 분포에는

- 이항분포 (Binomial distribution)

: binom()

- 초기하분포 (Hypergeometric distribution)

: hyper()

- 포아송 분포 (Poisson distribution)

: pois()

등이 있습니다. 지난 포스팅에서 이항분포, 초기하분포를 다루었으며, 이번 포스팅에서는 이산형 확률 분포 중에서 포아송 분포 (Poisson distribution)에 대해서 알아보도록 하겠습니다.

포아송 분포의 포아송은 17세기의 프랑스의 수학자인 S.D.Poisson 의 이름으로서, S.D.Poisson 은 이항확률을 손으로 계산하는 것이 매우 어렸었던 점을 개선하고자 지수식을 사용해서 이항확률의 근사값을 계산할 수 있는 확률 함수를 만들었는데요, 그것이 바로 포아송 분포입니다. (요즘에는 컴퓨터로 어떤 확률분포이든지 간에 눈 깜짝할 사이에 계산할 수 있지만 17세기에는 힘들었겠지요? ^^')

확률변수 X가 이항분포 B(n, p)를 따를 때 np = λ 로 일정하게 두고, n이 충분히 크고 p가 0에 가까울 때 이항분포에 근사하는 포아송 분포 (Poisson distribution)은 아래와 같습니다.

* 참고 : e 는 자연로그의 밑으로 2.718281827845.... 의 무리수

포아송 분포(Poisson distribution)는 일정한 단위 시간, 단위 공간에서 어떤 사건이 랜덤하게 발생하는 경우에 사용할 수 있는 이산형 확률분포입니다. 가령, 1시간 동안 은행에 방문하는 고객의 수, 1시간 동안 콜센터로 걸려오는 전화의 수, 1달 동안 경부고속도로에서 교통사고가 발생하는 건수, 1년 동안 비행기가 사고가 발생하는 건수, 책 1페이지당 오탈자가 발생하는 건수, 반도체 웨이퍼 25장 당 불량 건수 등과 같이 단위 시간 혹은 단위 공간에서의 랜덤한 사건에 대해 사용하게 됩니다.

(참고로, 연속형 확률 분포 중 지수 분포(exponential distribution)는 특정 사건과 사건 사이의 간격에 대한 분포로서, 헷갈리지 않도록 주의가 필요합니다.)

포아송 분포에서 모수 λ (lambda 라고 발음함)는 일정한 단위 시간 또는 단위 공간에서 랜덤하게 발생하는 사건의 평균 횟수를 의미합니다.

R 에서 포아송 분포를 사용할 수 있는 함수 및 모수는 아래 표와 같습니다.

함수 구분		포아송 분포 R 함수/ 모수 pois()
밀도 함수	d	dpois(x, lambda)
누적 분포 함수	p	ppois(q, lambda, lower.tail = TRUE/FALSE
분위수 함수	q	qpois(p, lambda, lower.tail = TRUE/FALSE
난수 발생	r	rpois(n, lambda)

그럼, 이제부터 하나씩 차근차근 예를 들어보겠습니다.

(1) λ = 3 인 포아송 분포 그래프 (Poisson distribution plot of lambda = 3)

> # (1) 포아송 분포 그래프 (Poisson distribution plot)
> plot(dpois(x=c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), lambda = 3), 
+      type='h',
+      main = "Poisson distribution, lambda = 3")

(2) P ( X = 15) 확률 계산 : dpois(x, lambda)

문제) 어느 은행의 1시간 당 방문 고객 수가 λ = 20 인 포아송 분포를 따른다고 한다. 그럼 1시간 당 방문고객수가 15명일 확률은?

> # (2) P(X = 15) in Poisson distribution with lambda = 20

> dpois(x=15, lambda = 20)

[1] 0.05164885

(3) P ( X <= 15) 확률 계산 : ppois(q, lambda, lower.tail = TRUE)

문제) 어느 은행의 1시간 당 방문 고객 수가 λ = 20 인 포아송 분포를 따른다고 한다. 그럼 1시간 당 방문고객수가 15명 이하일 확률은?

> # (3) P(X =< 15) in Poisson distribution with lambda = 20

> ppois(q=15, lambda = 20, lower.tail = TRUE)

[1] 0.1565131

 
> sum(dpois(x=c(0:15), lambda = 20)) # the same result with the ppois()

[1] 0.1565131

(4) 특정 확률 값에 해당하는 분위수 계산 : qpois(p, lambda, lower.tail=TRUE)

문제) 어느 은행의 1시간 당 방문 고객 수가 λ = 20 인 포아송 분포를 따른다고 한다. 만약 1시간 동안 방문한 고객수에 해당하는 확률이 15.65131% 이라면 이는 몇 명에 해당하는가?

> qpois(p=0.1565131, lambda = 20, lower.tail = TRUE)

[1] 15

(5) 난수 발생 : rpois(n, lambda)

문제 ) λ = 20 인 포아송 분포에서 n = 1000 개의 난수를 발생시키고, 도수분포표를 구하고, 도수별 막대그래프를 그려보아라.

> rpois(n=1000, lambda = 20)
   [1] 18 17 21 16 19 25 18 22 18 24 28 23 21 11 19 25 20 27 24 27 12 17 11 16 17 18 21 17 16 22 16
  [32] 20 24 18 26 15 20 17 25 18 16 23 18 17 20 28 18 16 21 18 21 20 16 21 22 11 20 18 20 10 15 17
  [63] 14 15 22 20 16 26 18 25 14 11 22 24 23 19 26 12 17 23 17 24 21 17 19 24 28 26 18 24 17 19 18
  [94] 19 24 22 23 20 25 21 22 16 20 24 20 22 24 25 22 23 20 19 28 19 21 15 27 27 17 14 20 25 26 25
 [125] 26 16 22 16 22 21 15 15 21 19 29 15 23 21 23 31 16 33 18 21 24 28 34 25 19 24 22 23 30 27 21
 [156] 20 16 18 18 13 21 20 23 21 15 12 18 25 16 15 26 18 22 18 10 26 23 19 13 18 22 23 21 22 12 20
 [187] 20 19 17 18 18 15 20 11 25 21 20 20 20 22 19 31 18 23 16 18 21 29 19 20 20 16 22 18 16 22 18
 [218] 14 18 23 18 22 15 15 14 19 20 23 11 20 17 21 23 17 21 12 28 22 19 16 20 14 27 20 26 19 22 19
 [249] 22 21 20 24 21 23 25 14 19 18 22  8 20 13 19 22 22 20 18 15 22 22 13 20 24 18 25 18 19 30 22
 [280] 18 30 22 10 25 21 18 19 19  7 25 15 27 23 26 21 16 21 19 21 24 16 18 18 21 25 10 15 25 18 21
 [311] 18 27 15 26 21 33 13 20 18 22 27 16 11 18 18 26 20 28 20 17 22 19 24 25 13 13 16 21 21 21 20
 [342] 18 21 18 16 11 15 24 19 24 31 23 24 17 21 20 19 16 20 23 27 18 23 20 14 27 14 26 15 14 10 23
 [373] 22 29 20 17 24 29 26 17 15 16 23 27 14 17 21 21 14 22 27 16 22 19 19 15 25 20 23 16 20 16 16
 [404] 20 18 18 16 31 15 13 15 16 18 17 23 19 20 18 24 13 24 20 25 22 15 17 25 12 11 19 16 19 26 29
 [435] 23 18 17 15 23 18 32 23 30 21 19 24 24 21 17 22 23 27 21 23 17 20 20 22 15 15 21 32 23 24 16
 [466] 28 18 23 24 22 20 18 19 18 15 16 16 20 17 16 12 18 25 23 21 17 19 21 24 20 16 20 26 19 21 28
 [497] 25 16 14 19 16 19 25  9 12 20 20 18 23 27 19 12  9 21 15 27 17 21 23 18 17 11 23 20 19 25 18
 [528] 17 19 22 18 28 25 25 17 26 30 26 17 22 16 22 16 31 15 25 16 23 15 23 15 20 20 18 21 19 15 23
 [559] 24 23 21 21 14 15 20 23 29 19 15 18 18 12 19 17 22 33 24 19 10 26 22 24 23 25 14 16 19 18 21
 [590] 19 32 18 20 22 23 16 18 22 16 25 14 19 23 19 14 17 24 26 19 18 27 20 21 19 21 27 20 30 17 26
 [621] 23 27 23 23 24 26 21 21 13 21 20 22 16 23 22 27 16 22 25 26 14 32 27 34 19 18 23 19 19 17 15
 [652] 29 15 19 18 16 19 19 15 16 19 28 18 19 17 14 19 23 25 31 24 24 14 19 17 19 25 20 24 21 17 20
 [683] 20 25 24 18 16 22 20 18 16 22 17 12 20 25 21 39 22 19 12 12 25 18 31 23 15 20 20 23 15 23 15
 [714] 22 17 14 14 13 16 17 22 18 14 26 28 21 17 24 26 26 17 14 15 24 26 11 25 31 20 24 19 27 19 30
 [745] 18 16 24 14 23 19 19 24 18 19 18 19 16 21 18 18 14 19 12 20 27 23 20 25 31 17 17 24 20 32 14
 [776] 29 19 26 22 21 21 17  9 19 23 23 19 28 15 19 17 19 26 20 23 19 18 20 14 21 14 22 16 16 12 25
 [807] 23 14 13 18 19 22 16 21 25 21 24 13 20 21 24 20 21 35 15 23 16 12 25 16 16 18 28 18 27 19 18
 [838] 19 27 25 23 16 26 16 17 17 21 12 20 26 18 22 15 26 16 21 20 16 13 25 14 16 13 23 19 12 23 21
 [869] 17 16 17 29 24 16 15 14 17 24 25 22 23 23 24 22 22 26 20 21 18 26  8 20 18 18 22 21 23 23 19
 [900] 16 13 23 14 17 20 23 18 20 19 23 22 21 19 19 17 15 22 26 22 17 18 29 14 16 26 21 19 17 15 21
 [931] 26 19 23 23 18 23 15 15 24 22 25 16 18 19 13 18 25 19 22 15 18 20 28 15 24 20 17 21 20 23 17
 [962] 22 18 25 19 21 21 22 21 18 18 21 17 26 16 23 25 27 33 30 20 22 24 17 14 21 21 24 20 24 23 14
 [993] 20 29 12 25 18  6 14 22

> table(rpois(n=1000, lambda = 20))

 7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
 1  2  3  6 13 17 33 31 45 74 76 77 87 95 86 73 65 59 37 36 28 16 14 12  4  8  2

> plot(table(rpois(n=1000, lambda = 20)))

위 그래프를 보면 λ = 20 이므로 평균이 20 인 위치에서 가장 높게 모여있고, 오른쪽으로 꼬리가 긴 포아송 분포를 따르고 있음을 알 수 있습니다.

빈도 데이터 분석을 위한 포아송 회귀모델(Poisson Regression Model)과 과대산포, 과대영 문제가 있을 경우 대안으로 활용할 수 있는 모델에 대한 내용은 https://rfriend.tistory.com/490 를 참고하세요.

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾸욱 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 지수분포 (exponential distribution) : exp() (6)	2015.09.22
R 균등분포 (uniform distribution) : unif() (5)	2015.09.19
R 정규분포(Normal distribution) : norm() (0)	2015.09.16
R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k) (0)	2015.09.13
R 이항분포 (Binomial Distribution) : dbinom() (6)	2015.09.13

Posted by Rfriend

R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k)

R 분석과 프로그래밍/R 통계분석 2015. 9. 13. 23:50

이산형 확률 분포에는

- 이항분포 (Binomial distribution)

: binom()

- 초기하분포 (Hypergeometric distribution)

: hyper()

- 포아송 분포 (Poisson distribution)

: pois()

등이 있습니다.

이전 포스팅에서 이산형 확률 분포(Discrete probability distribution) 중에서 베르누이 시행을 따르는 이항분포(binomial distribution)에 대해서 알아보았습니다. 이항분포는 모집단에서 표본을 추출하고 다시 샘플링한 표본을 다시 모집단에 집어넣고 (다른 말로 '복원') 또 표본 추출하는 작업을 하게 됩니다. 이렇다 보니 성공할 확률 p 가 항상 동일합니다.

그런데, 만약 모집단에서 표본을 추출하고 그것을 다시 모집단에 집어넣지 않는다면 (다른 말로 '복원하지 않는다면') 그때는 이런 과정을 베르누이 시행이라고 할 수가 없게 되고, 확률도 표본을 추출할 때마다 자꾸 바뀌게 됩니다. 쉽게 예를 들어서 빨간 공 5개, 파란 공 5개가 들어있는 주머니가 있다고 했을 때 처음에는 빨간 공이 뽑힐 확률이 5/10 = 0.5 이지만, 만약에 첫번째로 주머니에서 공을 하나 뽑았을 때 그것이 빨간 공이었다면 두번째에 뽑을 때 그것이 빨간공일 확률은 4/9 = 0.44444로 바뀌게 됩니다. (첫번째에 빨간공을 하나 뽑고 그것을 다시 주머니에 집어넣지(복원하지) 않았으므로...)

이처럼 성공확률이 p, 크기가 N인 모집단에서 n 개의 표본을 비복원으로 추출할 때 성공이 일어나는 횟수를 X라고 하면, X는 "모수 N, n, p인 초기하 분포를 따른다"고 합니다.

만약 초기하분포에서 성공이 일어날 확률 p를 일정하게 했을 때 모집단의 수 N을 무한대로 크게 하면 샘플을 복원추출하느냐 비복원추출하느냐가 별 의미가 없어지게 되므로 초기하분포는 이항분포로 근사하게 됩니다. 위의 예를 인용하자면 빨간공이 5억개, 파란공이 5억개 들어있는 주머니에서 공을 5개를 샘플로 뽑는다고 했을 때 1번째에 빨간공이 나왔을 때 2번째에 빨간공이 나올 확률은 복원추출이냐 비복원추출이냐가 성공확률 p에 거의 영향이 없다는 뜻입니다.

따라서 모집단을 구성하는 개체가 성공(success, 1)/실패(failure, 0)의 두 가지 경우의 수만을 가지고 있는데, 모집단 N이 작을 때는 복원추출이면 이항분포, 비복원추출이면 초기하분포를 사용해야 하며, 만약 모집단 N이 충분히(?) 클 경우에는 초기하분포가 이항분포로 근사하므로 둘 중 아무거나 사용해도 대세에는 지장없다고 알고 있으면 되겠습니다.

초기하분포의 밀도 함수, 누적분포 함수, 분위수 함수, 난수 발생을 위한 R 함수 및 모수는 아래와 같습니다.

구분		초기하분포 R 함수 / 모수
밀도 함수	d	dhyper(x, m, n, k)
누적분포 함수	p	phyper(q, m, n, k, lower.tail = TRUE/FALSE)
분위수 함수	q	qhyper(p, m, n, k, lower.tail = TRUE/FALSE)
난수 발생	r	rhyper(nn, m, n, k)

* 참고: 모집단이 m과 n의 개체로 구성되어 있는데 k개의 표본을 추출

lower.tail = TRUE 이면 확률변수 x를 기준으로 왼쪽 꼬리를 의미

초기하분포 그래프 (Hypergeometric distribution plot) 을 예로 하나 그려보면 아래와 같습니다.

> plot(dhyper(x=c(0:20), m=5, n=20, k=5), 
+      type='h', 
+      main = "Hypergeometric distribution, with m=5, n=20, k=5")

(1) P(X = 4) 확률 계산 : dhyper(x, m, n, k)

문제) 어떤 바리스타가 아메리카노 향 냄새를 맡아보기만 하면 "콜롬비아 원두"로 만든 것인지 아닌지를 맞출 수 있다고 주장하였다고 합니다. 그래서 그 바리스타를 데려다가 실험을 해보았습니다. "콜롬비아 원두"로 만든 아메리카노 5잔 (m=5), 콜롬비아 원두 말고 다른 지역 원두로 만든 아메리카노 20잔 (n=20) 을 만들어 놓고 그 바리스타에게 "콜롬비아 원두"로 만든 아메리카노 5잔을 골라내 보라고 시켰습니다. 이때 "콜롬비아 원두"로 만든 아메리카노를 4잔 골라낼 확률은?

> dhyper(x=4, m=5, n=20, k=5)

[1] 0.001882176

이 문제의 경우 비복원추출에 해당하므로 초기하분포를 따른다고 볼 수 있으며, 총 25잔의 아메리카노 커피, "콜롬비아 원두"로 만든 것은 5잔인데 이 중에서 "콜롬비아 원두"로 만든 아메리카노를 4잔 골라낼 확률이 0.188% 이므로, 이 정도면 우연히 뽑아냈다고 보기는 힘들겠지요? 매우 예민한 코를 가진 바리스타라고 인정해줄 만 하겠습니다.

이해를 돕기 위해서 문제 하나 더 풀어보겠습니다.

문제 2) TV를 생산하는 제조회사에서 생산한 TV 100 대 중에서 품질이 양호한 TV가 95대, 불량품이 5대가 재고창고에 들어있다고 합니다. 이 재고 창고에서 TV 10개를 비복원추출한다고 했을 때 불량품이 3개가 포함되어 있을 확률은?

> dhyper(x=3, m=5, n=95, k=10)

[1] 0.006383528

(2) P(X <= 4) 확률 값 계산 : phyper(q, m, n, k, lower.tail=TRUE)

> # (2) P(X <= 4) 확률 값 계산 : phyper(q, m, n, k, lower.tail=TRUE)

> phyper(q=4, m=5, n=20, k=5, lower.tail=TRUE)

[1] 0.9999812

phyper() 함수를 사용하지 않는다면, 추천할만한 방법은 아니지만 좀 무식하게 dhyper()함수로 X가 0, 1, 2, 3, 4 일 때의 개별 밀도함수를 구해서 sum()해주는 방법을 사용해도 결과는 똑같습니다. 개념 이해하는데 참고만 하세요.

> dhyper(x=0, m=5, n=20, k=5)

[1] 0.2918125


> dhyper(x=1, m=5, n=20, k=5)

[1] 0.4559571

> dhyper(x=2, m=5, n=20, k=5)

[1] 0.214568

> dhyper(x=3, m=5, n=20, k=5)

[1] 0.03576134

> dhyper(x=4, m=5, n=20, k=5)

[1] 0.001882176
>

> sum(dhyper(x=c(0:4), m=5, n=20, k=5))

[1] 0.9999812

(3) 특정 확률에 해당하는 분위수 구하기 : qhyper(p, m, n, k, lower.tail = TRUE/FALSE)

 
> dhyper(x=3, m=5, n=20, k=5)
 
[1] 0.03576134
> 

> qhyper(p=0.03576134, m=5, n=20, k=5, lower.tail=F)
 
[1] 3
 
#-------------------------------
 
> phyper(q=3, m=5, n=20, k=5, lower.tail = T)
 
[1] 0.998099
> 
 
> qhyper(p=0.998099, m=5, n=20, k=5, lower.tail=T)
 
[1] 3

위의 첫번째 예제를 보면 dhyper(x=3, m=5, n=20, k=5) = 0.03576134 입니다. 즉, m이 5개, n이 20개 들어있는 모집단에서 5번 비복원추출했을 때 확률변수 x가 3번 발생할 확률이 0.03576134라는 뜻입니다.

그런데 만약 확률 0.03576134를 알고 있을 때 이에 해당하는 확률 변수 x를 구하는 것이 위의 첫번째 예제에서 qhyper(p=0.03576134, m=5, n=20, k=5, lower.tail=FALSE) = 3 으로 문제를 푼 것입니다.

위의 두번째 예제에서는 phyper(q=3, m=5, n=20, lower.tail = T = 0.998099 로서 phyper() 함수를 사용했으므로 누적분포함수에 대한 확률을 계산한 것입니다. 이처럼 누적확률분포의 분위수를 계산하려면 이번에는 qhyper(p, m, n, k, lower.tail=TRUE) 처럼 lower.tail=TRUE 옵션을 설정해주면 되겠습니다.

(4) 난수 발생 : rhyper(nn, m, n, k)

m=5, n=20 인 초기하분포에서 비복원으로 5개를 추출하는 것을 1000번 모의실험한 후에 도수분포표를 구해보겠습니다.

> random_hyper <- rhyper(1000, m=5, n=20, k=5)

> random_hyper

   [1] 1 0 0 0 0 0 1 0 3 1 0 1 1 1 1 2 0 3 0 1 2 2 1 2 0 1 1 1 3 0 1 1 0 2 1 2 1 0 0 1 2 0 1 2 0 2
  [47] 1 2 1 1 1 1 2 1 1 2 2 1 1 2 1 0 1 1 1 3 1 0 1 1 0 0 1 1 0 1 2 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1
  [93] 2 1 2 1 0 0 0 1 1 3 1 1 2 0 1 2 1 1 0 0 1 2 0 1 2 1 1 0 1 1 1 0 1 1 2 0 2 0 1 1 0 0 1 2 2 1
 [139] 0 2 2 1 3 0 2 2 1 1 0 3 0 0 1 1 1 0 0 2 2 0 3 0 3 2 0 0 0 2 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0
 [185] 1 2 1 0 2 0 1 1 1 0 2 2 1 1 0 1 1 0 0 0 2 1 3 1 2 2 1 2 1 1 1 1 2 0 1 1 0 0 2 0 2 1 0 1 1 2
 [231] 2 2 0 3 1 1 1 2 1 3 2 1 2 1 2 0 0 1 0 0 1 2 2 2 1 1 1 4 1 2 0 2 1 1 1 1 1 1 1 1 0 1 0 2 0 0
 [277] 1 1 0 2 1 0 3 1 0 0 1 2 2 2 0 2 1 1 0 3 1 2 1 2 0 1 1 1 2 1 0 1 2 1 0 0 3 0 2 2 0 1 1 0 1 0
 [323] 2 2 1 1 3 1 1 0 1 2 0 0 0 0 0 3 2 0 2 2 1 1 1 2 1 0 1 1 2 0 0 1 2 0 0 2 1 2 1 2 2 2 1 1 0 1
 [369] 1 2 1 2 1 0 1 0 1 1 1 0 2 2 1 0 2 0 0 0 1 0 2 1 1 2 1 1 0 1 0 1 2 1 0 1 1 0 3 2 2 3 0 1 0 1
 [415] 1 0 0 0 2 2 1 2 1 1 1 0 2 2 2 0 1 2 3 2 1 1 0 1 1 1 1 0 1 2 3 1 0 1 2 2 0 1 1 2 1 0 2 1 0 1
 [461] 1 1 1 0 0 0 1 1 2 0 1 2 1 0 1 1 1 0 1 3 2 0 0 1 1 1 1 3 0 0 3 2 0 2 0 1 1 0 0 2 1 0 1 0 1 1
 [507] 0 1 1 2 1 1 1 1 1 2 0 0 1 2 0 1 0 1 0 1 1 1 1 0 1 3 0 1 0 1 2 3 1 1 1 2 0 2 1 1 0 0 2 2 2 0
 [553] 1 1 1 0 1 0 1 0 1 1 0 2 0 0 1 1 2 1 1 0 0 0 1 1 1 1 3 2 0 2 0 2 0 2 1 0 0 1 2 1 1 2 1 0 0 2
 [599] 0 0 0 1 1 2 1 1 2 2 1 0 1 3 1 2 2 2 2 1 0 2 1 4 3 3 0 3 1 2 2 1 1 2 1 2 3 2 2 0 1 1 2 3 1 0
 [645] 1 1 2 1 1 1 1 1 1 1 2 1 1 2 0 2 2 1 2 0 1 0 3 2 0 2 1 0 1 1 0 0 0 1 1 0 3 0 2 1 1 1 2 1 1 0
 [691] 1 0 1 1 0 2 2 1 1 2 0 0 0 2 0 1 1 3 0 0 1 2 0 0 0 2 1 2 0 1 1 1 0 2 2 1 3 0 1 3 2 0 1 2 1 1
 [737] 1 0 1 1 2 1 1 1 2 1 1 1 0 2 0 1 1 1 0 2 0 1 1 0 1 1 1 1 1 0 0 0 2 0 0 1 2 1 2 0 1 1 1 0 0 0
 [783] 1 1 1 0 0 0 3 0 2 1 2 1 1 1 0 1 1 1 0 1 0 0 2 0 1 4 0 1 0 1 2 0 2 0 1 1 1 1 1 0 2 2 1 1 2 1
 [829] 0 0 0 1 2 2 1 2 1 1 1 0 2 1 1 1 1 1 3 1 1 2 1 2 1 3 1 0 1 1 0 3 0 0 1 1 2 1 0 0 0 0 1 1 1 2
 [875] 3 0 1 0 0 1 0 0 2 0 1 0 1 0 0 0 0 0 2 0 1 2 0 2 0 1 1 1 0 0 0 1 1 2 2 2 1 0 1 3 1 2 1 1 2 1
 [921] 3 0 1 2 2 1 0 1 1 0 0 2 1 0 0 1 2 1 1 1 1 1 1 0 0 1 1 0 0 1 2 0 1 2 1 2 1 1 0 1 1 0 0 0 1 1
 [967] 2 1 1 0 0 0 2 1 1 0 0 2 1 1 0 2 2 2 0 1 2 2 0 2 3 2 2 1 0 1 1 0 0 0

> table(random_hyper)

random_hyper

  0   1   2   3   4

302 434 215  46   3

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾸욱 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 지수분포 (exponential distribution) : exp() (6)	2015.09.22
R 균등분포 (uniform distribution) : unif() (5)	2015.09.19
R 정규분포(Normal distribution) : norm() (0)	2015.09.16
R 포아송 분포 (Poisson distribution) : pois() (7)	2015.09.15
R 이항분포 (Binomial Distribution) : dbinom() (6)	2015.09.13

Posted by Rfriend

R 이항분포 (Binomial Distribution) : dbinom()

R 분석과 프로그래밍/R 통계분석 2015. 9. 13. 21:58

통계에서 다양한 확률분포를 배웠을 텐데요, 왜 확률분포를 어디에 써먹는 것인지 잘 모르고 '어려운 개념'에 통계를 멀리하게 되는 시발점이 되는 경우가 많지 않았을까 싶습니다.

만약 우리가 모집단이 어떤 확률분포를 띠고 있는지를 안다면 주어진 분위수에 대한 확률을 계산한다거나, 아니면 특정 확률에 해당하는 분위수를 계산할 수 있습니다. 그리고 시뮬레이션을 한다고 했을 때 확률분포를 이용해서 난수를 발생시켜서 사용할 수도 있습니다.

확률분포는 크게 이산형 확률 분포(Discrete probability distributon)과 연속형 확률 분포(Continuous probability distribution)으로 나눌 수 있습니다.

이산형 확률 분포(Discrete Probability Distribution)에는 이항분포(Binomial distribution), 초기하분포(Hypergeometric distribution), 포아송 분포(Poisson distribution) 등이 있습니다.

연속형 확률 분포 (Continuous probability distribution)에는 정규분포(Normal distributio), t-분포(t-distribution), F분포(F-distributio), 균등분포(Uniform distribution), 카이제곱분포(Chisq-distribution), 감마분포(Gamma distribution) 등이 있습니다.

이번 포스팅에서는 이산형 확률 분포 중 첫번째로 이항분포(Binomial distiribution)의 기본 개념에 대해 알아보고 R로 이항분포 그래프, 주어진 분위수에 대한 확률 계산, 그리고 특정 확률에 해당하는 분위수 계산, 난수 발생을 하는 방법에 대해 소개해 보겠습니다.

어떤 실험을 반복해서 시행한다고 했을 때 각 시행마다 "성공(success, 1)" 또는 "실패(failure, 0)"의 두 가지 경우의 수만 나온다고 할 때, 우리는 이런 시행을 "베르누이 시행(Bernoulli trial)"이라고 합니다.

그리고 성공확률이 p인 베르누이 시행을 n번 반복했을 때 성공하는 횟수를 X라 하면, 확률변수 X는 모수 n과 p인 이항분포(Binomial distributio)을 따른다고 합니다.

* 참고: f(x)의 첫번째 줄에 있는 n개 중에서 x개를 복원추출로 뽑는(다른 말로, n번 시행할 때 x번 성공하는) nCx = n! / x!(n-x)!

예를 들어서, 시행 회수 20회, 복원추출, 성공/실패 확률 50%인 베르누이 시행 (가령, 동전 던지기 앞(Head), 뒤(Tail) 을 R의 sample() 함수를 사용해서 시뮬레이션을 해보면 아래와 같습니다. 시행을 할 때마다 조금씩 달라지지만 성공/실패 회수는 거의 반, 반으로 비슷함을 알 수 있습니다.

> sample(c("H", "T"), size=20, replace=TRUE, prob=c(0.5, 0.5))
 [1] "H" "T" "T" "H" "H" "T" "H" "T" "H" "H" "H" "T" "H" "T" "H" "T" "T" "H" "H" "H"
> sample(c("H", "T"), size=20, replace=TRUE, prob=c(0.5, 0.5))
 [1] "T" "H" "T" "H" "H" "H" "H" "T" "H" "H" "H" "H" "T" "T" "H" "T" "T" "T" "H" "T"
> sample(c("H", "T"), size=20, replace=TRUE, prob=c(0.5, 0.5))
 [1] "H" "H" "T" "T" "H" "H" "H" "T" "T" "T" "T" "H" "H" "T" "T" "T" "T" "H" "T" "T"

이항분포의 밀도 함수, 누적 분포 함수, 분위수 함수, 난수 발생을 위한 R 함수 및 모수는 다음과 같습니다.

구분		이항분포(binom) R 함수/모수
밀도 함수	d	dbinom(x, size, prob)
누적 분포 함수	p	pbinom(q, size, prob, lower.tail = TRUE / FALSE)
분위수 함수	q	qbinom(p, size, prob, lower.tail = TRUE / FALSE)
난수 발생	r	rbinom(n, size, prob)

(1) 성공확률 0.5인 베르누이 시행을 20회 했을 때의 이항분포 밀도함수 : plot()

> y <- dbinom(0:20, size=20, prob=0.5)

> plot(0:20, y, type='h', lwd=5, col="grey", ylab="Probability", xlab="확률변수 X", 
+      main = c("X ~ B(20, 0.5)"))

(2) P(X = 12) 확률 계산 : dbinom(x, size, prob)

> # P(X = 12) 확률 계산
> dbinom(12, size=20, prob=0.5)
[1] 0.1201344

(3) P(X <= 12) 확률 계산 : pbinom(x, size, prob, lower.tail=TRUE)

> # (3) P(X <= 12) 확률 계산
> pbinom(12, size=20, prob=0.5, lower.tail = TRUE)
[1] 0.868412
> 
> sum(dbinom(0:12, size=20, prob=0.5)) # dbinom()을 sum()해도 동일한 값
[1] 0.868412

(4) P(X > 12) 확률 계산 : pbinom(x, size, prob, lower.tail = FALSE)

> pbinom(12, size=20, prob=0.5, lower.tail = FALSE)
[1] 0.131588


> 1 - pbinom(12, size=20, prob=0.5, lower.tail = TRUE) # 동일한 값
[1] 0.131588

(5) 이항분포 난수 발생 : rbinom(n, size, prob)

> rbinom(12, size=20, prob=0.5)
 [1] 13 10  9 14 10  9  9  7 17  9  6 10

> rbinom(12, size=20, prob=0.5)
 [1] 11 11  9 13  4 12  7 10 10 14 10 13

> rbinom(12, size=20, prob=0.5)
 [1]  7  5 11  9 13 10 11 14 10 14  4 10

(6) 이항분포 누적분포함수 그래프

> # (6) 누적분포함수 그래프
> plot(pbinom(0:20, size=20, prob=0.5), type='h')

이메일로 아래와 같은 질문을 보내주신 분이 계셔서 질문이랑 답변 공유합니다. repeat{} 로 프로그램 짜서 루프를 돌려서 근사값을 구했습니다.

Question)

이항분포에서 prob의 값이 주어지지 않았을때 값을 구할 수 있을까요?

예를들어 200번 실험 중 50번 이상 실험이 설공할 확률이 0.2보다 클 경우를 구하려고 합니다.

1-pbinom(49,200,prob)>0.2

인데, 이와 관련해 prob를 구할 수 있을까요?

> i <- 0.01
> repeat {
+   pbinom_x_50_upper_size_200 <- pbinom(50, size=200, prob=i, lower.tail = FALSE) 
+   cat("pbinom(x=50, size=200, prob=", i, ", lower.tail=FALSE) = ", pbinom_x_50_upper_size_200, "\n", sep="")
+   if (pbinom_x_50_upper_size_200 > 0.2) break
+   i <- i+0.001
+ }
pbinom(x=50, size=200, prob=0.01, lower.tail=FALSE) = 3.0749e-55
pbinom(x=50, size=200, prob=0.011, lower.tail=FALSE) = 3.425864e-53
pbinom(x=50, size=200, prob=0.012, lower.tail=FALSE) = 2.499419e-51
pbinom(x=50, size=200, prob=0.013, lower.tail=FALSE) = 1.277906e-49
pbinom(x=50, size=200, prob=0.014, lower.tail=FALSE) = 4.826703e-48
pbinom(x=50, size=200, prob=0.015, lower.tail=FALSE) = 1.40426e-46
pbinom(x=50, size=200, prob=0.016, lower.tail=FALSE) = 3.254805e-45
pbinom(x=50, size=200, prob=0.017, lower.tail=FALSE) = 6.178075e-44
pbinom(x=50, size=200, prob=0.018, lower.tail=FALSE) = 9.825351e-43
pbinom(x=50, size=200, prob=0.019, lower.tail=FALSE) = 1.334491e-41
pbinom(x=50, size=200, prob=0.02, lower.tail=FALSE) = 1.573198e-40
pbinom(x=50, size=200, prob=0.021, lower.tail=FALSE) = 1.632106e-39
pbinom(x=50, size=200, prob=0.022, lower.tail=FALSE) = 1.507896e-38
pbinom(x=50, size=200, prob=0.023, lower.tail=FALSE) = 1.253499e-37
pbinom(x=50, size=200, prob=0.024, lower.tail=FALSE) = 9.460336e-37
pbinom(x=50, size=200, prob=0.025, lower.tail=FALSE) = 6.533438e-36
pbinom(x=50, size=200, prob=0.026, lower.tail=FALSE) = 4.157663e-35
pbinom(x=50, size=200, prob=0.027, lower.tail=FALSE) = 2.453052e-34
pbinom(x=50, size=200, prob=0.028, lower.tail=FALSE) = 1.349272e-33
pbinom(x=50, size=200, prob=0.029, lower.tail=FALSE) = 6.952835e-33
pbinom(x=50, size=200, prob=0.03, lower.tail=FALSE) = 3.371404e-32
pbinom(x=50, size=200, prob=0.031, lower.tail=FALSE) = 1.544453e-31
pbinom(x=50, size=200, prob=0.032, lower.tail=FALSE) = 6.708377e-31
pbinom(x=50, size=200, prob=0.033, lower.tail=FALSE) = 2.771777e-30
pbinom(x=50, size=200, prob=0.034, lower.tail=FALSE) = 1.092671e-29
pbinom(x=50, size=200, prob=0.035, lower.tail=FALSE) = 4.120884e-29
pbinom(x=50, size=200, prob=0.036, lower.tail=FALSE) = 1.490533e-28
pbinom(x=50, size=200, prob=0.037, lower.tail=FALSE) = 5.182438e-28
pbinom(x=50, size=200, prob=0.038, lower.tail=FALSE) = 1.735721e-27
pbinom(x=50, size=200, prob=0.039, lower.tail=FALSE) = 5.610737e-27
pbinom(x=50, size=200, prob=0.04, lower.tail=FALSE) = 1.753602e-26
pbinom(x=50, size=200, prob=0.041, lower.tail=FALSE) = 5.30802e-26
pbinom(x=50, size=200, prob=0.042, lower.tail=FALSE) = 1.558447e-25
pbinom(x=50, size=200, prob=0.043, lower.tail=FALSE) = 4.444567e-25
pbinom(x=50, size=200, prob=0.044, lower.tail=FALSE) = 1.232885e-24
pbinom(x=50, size=200, prob=0.045, lower.tail=FALSE) = 3.330502e-24
pbinom(x=50, size=200, prob=0.046, lower.tail=FALSE) = 8.771904e-24
pbinom(x=50, size=200, prob=0.047, lower.tail=FALSE) = 2.254996e-23
pbinom(x=50, size=200, prob=0.048, lower.tail=FALSE) = 5.663776e-23
pbinom(x=50, size=200, prob=0.049, lower.tail=FALSE) = 1.391196e-22
pbinom(x=50, size=200, prob=0.05, lower.tail=FALSE) = 3.344884e-22
pbinom(x=50, size=200, prob=0.051, lower.tail=FALSE) = 7.878612e-22
pbinom(x=50, size=200, prob=0.052, lower.tail=FALSE) = 1.819442e-21
pbinom(x=50, size=200, prob=0.053, lower.tail=FALSE) = 4.122599e-21
pbinom(x=50, size=200, prob=0.054, lower.tail=FALSE) = 9.171801e-21
pbinom(x=50, size=200, prob=0.055, lower.tail=FALSE) = 2.004832e-20
pbinom(x=50, size=200, prob=0.056, lower.tail=FALSE) = 4.308392e-20
pbinom(x=50, size=200, prob=0.057, lower.tail=FALSE) = 9.108046e-20
pbinom(x=50, size=200, prob=0.058, lower.tail=FALSE) = 1.895194e-19
pbinom(x=50, size=200, prob=0.059, lower.tail=FALSE) = 3.883594e-19
pbinom(x=50, size=200, prob=0.06, lower.tail=FALSE) = 7.841276e-19
pbinom(x=50, size=200, prob=0.061, lower.tail=FALSE) = 1.560713e-18
pbinom(x=50, size=200, prob=0.062, lower.tail=FALSE) = 3.063672e-18
pbinom(x=50, size=200, prob=0.063, lower.tail=FALSE) = 5.933835e-18
pbinom(x=50, size=200, prob=0.064, lower.tail=FALSE) = 1.134446e-17
pbinom(x=50, size=200, prob=0.065, lower.tail=FALSE) = 2.141708e-17
pbinom(x=50, size=200, prob=0.066, lower.tail=FALSE) = 3.994201e-17
pbinom(x=50, size=200, prob=0.067, lower.tail=FALSE) = 7.361232e-17
pbinom(x=50, size=200, prob=0.068, lower.tail=FALSE) = 1.341135e-16
pbinom(x=50, size=200, prob=0.069, lower.tail=FALSE) = 2.416241e-16
pbinom(x=50, size=200, prob=0.07, lower.tail=FALSE) = 4.306171e-16
pbinom(x=50, size=200, prob=0.071, lower.tail=FALSE) = 7.593772e-16
pbinom(x=50, size=200, prob=0.072, lower.tail=FALSE) = 1.325457e-15
pbinom(x=50, size=200, prob=0.073, lower.tail=FALSE) = 2.290532e-15
pbinom(x=50, size=200, prob=0.074, lower.tail=FALSE) = 3.920003e-15
pbinom(x=50, size=200, prob=0.075, lower.tail=FALSE) = 6.645489e-15
pbinom(x=50, size=200, prob=0.076, lower.tail=FALSE) = 1.11626e-14
pbinom(x=50, size=200, prob=0.077, lower.tail=FALSE) = 1.858251e-14
pbinom(x=50, size=200, prob=0.078, lower.tail=FALSE) = 3.066499e-14
pbinom(x=50, size=200, prob=0.079, lower.tail=FALSE) = 5.01737e-14
pbinom(x=50, size=200, prob=0.08, lower.tail=FALSE) = 8.141335e-14
pbinom(x=50, size=200, prob=0.081, lower.tail=FALSE) = 1.310357e-13
pbinom(x=50, size=200, prob=0.082, lower.tail=FALSE) = 2.092391e-13
pbinom(x=50, size=200, prob=0.083, lower.tail=FALSE) = 3.315409e-13
pbinom(x=50, size=200, prob=0.084, lower.tail=FALSE) = 5.213765e-13
pbinom(x=50, size=200, prob=0.085, lower.tail=FALSE) = 8.138827e-13
pbinom(x=50, size=200, prob=0.086, lower.tail=FALSE) = 1.261367e-12
pbinom(x=50, size=200, prob=0.087, lower.tail=FALSE) = 1.941161e-12
pbinom(x=50, size=200, prob=0.088, lower.tail=FALSE) = 2.966815e-12
pbinom(x=50, size=200, prob=0.089, lower.tail=FALSE) = 4.503954e-12
pbinom(x=50, size=200, prob=0.09, lower.tail=FALSE) = 6.792603e-12
pbinom(x=50, size=200, prob=0.091, lower.tail=FALSE) = 1.017839e-11
pbinom(x=50, size=200, prob=0.092, lower.tail=FALSE) = 1.51559e-11
pbinom(x=50, size=200, prob=0.093, lower.tail=FALSE) = 2.242863e-11
pbinom(x=50, size=200, prob=0.094, lower.tail=FALSE) = 3.299116e-11
pbinom(x=50, size=200, prob=0.095, lower.tail=FALSE) = 4.824148e-11
pbinom(x=50, size=200, prob=0.096, lower.tail=FALSE) = 7.01333e-11
pbinom(x=50, size=200, prob=0.097, lower.tail=FALSE) = 1.013817e-10
pbinom(x=50, size=200, prob=0.098, lower.tail=FALSE) = 1.457388e-10
pbinom(x=50, size=200, prob=0.099, lower.tail=FALSE) = 2.083625e-10
pbinom(x=50, size=200, prob=0.1, lower.tail=FALSE) = 2.963049e-10
pbinom(x=50, size=200, prob=0.101, lower.tail=FALSE) = 4.191581e-10
pbinom(x=50, size=200, prob=0.102, lower.tail=FALSE) = 5.899028e-10
pbinom(x=50, size=200, prob=0.103, lower.tail=FALSE) = 8.260166e-10
pbinom(x=50, size=200, prob=0.104, lower.tail=FALSE) = 1.150917e-09
pbinom(x=50, size=200, prob=0.105, lower.tail=FALSE) = 1.595829e-09
pbinom(x=50, size=200, prob=0.106, lower.tail=FALSE) = 2.202187e-09
pbinom(x=50, size=200, prob=0.107, lower.tail=FALSE) = 3.024722e-09
pbinom(x=50, size=200, prob=0.108, lower.tail=FALSE) = 4.135395e-09
pbinom(x=50, size=200, prob=0.109, lower.tail=FALSE) = 5.628391e-09
pbinom(x=50, size=200, prob=0.11, lower.tail=FALSE) = 7.626442e-09
pbinom(x=50, size=200, prob=0.111, lower.tail=FALSE) = 1.028878e-08
pbinom(x=50, size=200, prob=0.112, lower.tail=FALSE) = 1.382111e-08
pbinom(x=50, size=200, prob=0.113, lower.tail=FALSE) = 1.848804e-08
pbinom(x=50, size=200, prob=0.114, lower.tail=FALSE) = 2.462857e-08
pbinom(x=50, size=200, prob=0.115, lower.tail=FALSE) = 3.267517e-08
pbinom(x=50, size=200, prob=0.116, lower.tail=FALSE) = 4.317741e-08
pbinom(x=50, size=200, prob=0.117, lower.tail=FALSE) = 5.683084e-08
pbinom(x=50, size=200, prob=0.118, lower.tail=FALSE) = 7.451237e-08
pbinom(x=50, size=200, prob=0.119, lower.tail=FALSE) = 9.732331e-08
pbinom(x=50, size=200, prob=0.12, lower.tail=FALSE) = 1.266415e-07
pbinom(x=50, size=200, prob=0.121, lower.tail=FALSE) = 1.641845e-07
pbinom(x=50, size=200, prob=0.122, lower.tail=FALSE) = 2.120853e-07
pbinom(x=50, size=200, prob=0.123, lower.tail=FALSE) = 2.729831e-07
pbinom(x=50, size=200, prob=0.124, lower.tail=FALSE) = 3.501321e-07
pbinom(x=50, size=200, prob=0.125, lower.tail=FALSE) = 4.475306e-07
pbinom(x=50, size=200, prob=0.126, lower.tail=FALSE) = 5.700739e-07
pbinom(x=50, size=200, prob=0.127, lower.tail=FALSE) = 7.237351e-07
pbinom(x=50, size=200, prob=0.128, lower.tail=FALSE) = 9.157779e-07
pbinom(x=50, size=200, prob=0.129, lower.tail=FALSE) = 1.155006e-06
pbinom(x=50, size=200, prob=0.13, lower.tail=FALSE) = 1.452053e-06
pbinom(x=50, size=200, prob=0.131, lower.tail=FALSE) = 1.819725e-06
pbinom(x=50, size=200, prob=0.132, lower.tail=FALSE) = 2.273389e-06
pbinom(x=50, size=200, prob=0.133, lower.tail=FALSE) = 2.831432e-06
pbinom(x=50, size=200, prob=0.134, lower.tail=FALSE) = 3.515784e-06
pbinom(x=50, size=200, prob=0.135, lower.tail=FALSE) = 4.352516e-06
pbinom(x=50, size=200, prob=0.136, lower.tail=FALSE) = 5.372534e-06
pbinom(x=50, size=200, prob=0.137, lower.tail=FALSE) = 6.612358e-06
pbinom(x=50, size=200, prob=0.138, lower.tail=FALSE) = 8.115017e-06
pbinom(x=50, size=200, prob=0.139, lower.tail=FALSE) = 9.93106e-06
pbinom(x=50, size=200, prob=0.14, lower.tail=FALSE) = 1.211969e-05
pbinom(x=50, size=200, prob=0.141, lower.tail=FALSE) = 1.475007e-05
pbinom(x=50, size=200, prob=0.142, lower.tail=FALSE) = 1.79027e-05
pbinom(x=50, size=200, prob=0.143, lower.tail=FALSE) = 2.167112e-05
pbinom(x=50, size=200, prob=0.144, lower.tail=FALSE) = 2.616362e-05
pbinom(x=50, size=200, prob=0.145, lower.tail=FALSE) = 3.150524e-05
pbinom(x=50, size=200, prob=0.146, lower.tail=FALSE) = 3.784002e-05
pbinom(x=50, size=200, prob=0.147, lower.tail=FALSE) = 4.533335e-05
pbinom(x=50, size=200, prob=0.148, lower.tail=FALSE) = 5.417471e-05
pbinom(x=50, size=200, prob=0.149, lower.tail=FALSE) = 6.458051e-05
pbinom(x=50, size=200, prob=0.15, lower.tail=FALSE) = 7.679736e-05
pbinom(x=50, size=200, prob=0.151, lower.tail=FALSE) = 9.110546e-05
pbinom(x=50, size=200, prob=0.152, lower.tail=FALSE) = 0.0001078224
pbinom(x=50, size=200, prob=0.153, lower.tail=FALSE) = 0.0001273072
pbinom(x=50, size=200, prob=0.154, lower.tail=FALSE) = 0.0001499647
pbinom(x=50, size=200, prob=0.155, lower.tail=FALSE) = 0.0001762501
pbinom(x=50, size=200, prob=0.156, lower.tail=FALSE) = 0.0002066743
pbinom(x=50, size=200, prob=0.157, lower.tail=FALSE) = 0.0002418089
pbinom(x=50, size=200, prob=0.158, lower.tail=FALSE) = 0.000282292
pbinom(x=50, size=200, prob=0.159, lower.tail=FALSE) = 0.0003288341
pbinom(x=50, size=200, prob=0.16, lower.tail=FALSE) = 0.0003822243
pbinom(x=50, size=200, prob=0.161, lower.tail=FALSE) = 0.0004433374
pbinom(x=50, size=200, prob=0.162, lower.tail=FALSE) = 0.0005131399
pbinom(x=50, size=200, prob=0.163, lower.tail=FALSE) = 0.0005926979
pbinom(x=50, size=200, prob=0.164, lower.tail=FALSE) = 0.0006831842
pbinom(x=50, size=200, prob=0.165, lower.tail=FALSE) = 0.0007858856
pbinom(x=50, size=200, prob=0.166, lower.tail=FALSE) = 0.0009022113
pbinom(x=50, size=200, prob=0.167, lower.tail=FALSE) = 0.0010337
pbinom(x=50, size=200, prob=0.168, lower.tail=FALSE) = 0.001182029
pbinom(x=50, size=200, prob=0.169, lower.tail=FALSE) = 0.001349022
pbinom(x=50, size=200, prob=0.17, lower.tail=FALSE) = 0.001536655
pbinom(x=50, size=200, prob=0.171, lower.tail=FALSE) = 0.001747067
pbinom(x=50, size=200, prob=0.172, lower.tail=FALSE) = 0.001982569
pbinom(x=50, size=200, prob=0.173, lower.tail=FALSE) = 0.002245645
pbinom(x=50, size=200, prob=0.174, lower.tail=FALSE) = 0.00253897
pbinom(x=50, size=200, prob=0.175, lower.tail=FALSE) = 0.002865405
pbinom(x=50, size=200, prob=0.176, lower.tail=FALSE) = 0.003228015
pbinom(x=50, size=200, prob=0.177, lower.tail=FALSE) = 0.003630065
pbinom(x=50, size=200, prob=0.178, lower.tail=FALSE) = 0.004075034
pbinom(x=50, size=200, prob=0.179, lower.tail=FALSE) = 0.004566614
pbinom(x=50, size=200, prob=0.18, lower.tail=FALSE) = 0.005108717
pbinom(x=50, size=200, prob=0.181, lower.tail=FALSE) = 0.005705476
pbinom(x=50, size=200, prob=0.182, lower.tail=FALSE) = 0.006361248
pbinom(x=50, size=200, prob=0.183, lower.tail=FALSE) = 0.007080619
pbinom(x=50, size=200, prob=0.184, lower.tail=FALSE) = 0.007868395
pbinom(x=50, size=200, prob=0.185, lower.tail=FALSE) = 0.00872961
pbinom(x=50, size=200, prob=0.186, lower.tail=FALSE) = 0.009669517
pbinom(x=50, size=200, prob=0.187, lower.tail=FALSE) = 0.01069359
pbinom(x=50, size=200, prob=0.188, lower.tail=FALSE) = 0.0118075
pbinom(x=50, size=200, prob=0.189, lower.tail=FALSE) = 0.01301716
pbinom(x=50, size=200, prob=0.19, lower.tail=FALSE) = 0.01432863
pbinom(x=50, size=200, prob=0.191, lower.tail=FALSE) = 0.01574819
pbinom(x=50, size=200, prob=0.192, lower.tail=FALSE) = 0.01728228
pbinom(x=50, size=200, prob=0.193, lower.tail=FALSE) = 0.01893752
pbinom(x=50, size=200, prob=0.194, lower.tail=FALSE) = 0.02072064
pbinom(x=50, size=200, prob=0.195, lower.tail=FALSE) = 0.02263853
pbinom(x=50, size=200, prob=0.196, lower.tail=FALSE) = 0.02469818
pbinom(x=50, size=200, prob=0.197, lower.tail=FALSE) = 0.02690666
pbinom(x=50, size=200, prob=0.198, lower.tail=FALSE) = 0.02927113
pbinom(x=50, size=200, prob=0.199, lower.tail=FALSE) = 0.03179876
pbinom(x=50, size=200, prob=0.2, lower.tail=FALSE) = 0.03449677
pbinom(x=50, size=200, prob=0.201, lower.tail=FALSE) = 0.03737237
pbinom(x=50, size=200, prob=0.202, lower.tail=FALSE) = 0.0404327
pbinom(x=50, size=200, prob=0.203, lower.tail=FALSE) = 0.04368489
pbinom(x=50, size=200, prob=0.204, lower.tail=FALSE) = 0.04713593
pbinom(x=50, size=200, prob=0.205, lower.tail=FALSE) = 0.0507927
pbinom(x=50, size=200, prob=0.206, lower.tail=FALSE) = 0.05466193
pbinom(x=50, size=200, prob=0.207, lower.tail=FALSE) = 0.05875016
pbinom(x=50, size=200, prob=0.208, lower.tail=FALSE) = 0.06306369
pbinom(x=50, size=200, prob=0.209, lower.tail=FALSE) = 0.0676086
pbinom(x=50, size=200, prob=0.21, lower.tail=FALSE) = 0.07239063
pbinom(x=50, size=200, prob=0.211, lower.tail=FALSE) = 0.07741525
pbinom(x=50, size=200, prob=0.212, lower.tail=FALSE) = 0.08268754
pbinom(x=50, size=200, prob=0.213, lower.tail=FALSE) = 0.08821221
pbinom(x=50, size=200, prob=0.214, lower.tail=FALSE) = 0.09399354
pbinom(x=50, size=200, prob=0.215, lower.tail=FALSE) = 0.1000354
pbinom(x=50, size=200, prob=0.216, lower.tail=FALSE) = 0.1063411
pbinom(x=50, size=200, prob=0.217, lower.tail=FALSE) = 0.1129135
pbinom(x=50, size=200, prob=0.218, lower.tail=FALSE) = 0.1197549
pbinom(x=50, size=200, prob=0.219, lower.tail=FALSE) = 0.1268671
pbinom(x=50, size=200, prob=0.22, lower.tail=FALSE) = 0.1342513
pbinom(x=50, size=200, prob=0.221, lower.tail=FALSE) = 0.141908
pbinom(x=50, size=200, prob=0.222, lower.tail=FALSE) = 0.1498372
pbinom(x=50, size=200, prob=0.223, lower.tail=FALSE) = 0.1580383
pbinom(x=50, size=200, prob=0.224, lower.tail=FALSE) = 0.1665098
pbinom(x=50, size=200, prob=0.225, lower.tail=FALSE) = 0.1752498
pbinom(x=50, size=200, prob=0.226, lower.tail=FALSE) = 0.1842556
pbinom(x=50, size=200, prob=0.227, lower.tail=FALSE) = 0.193524
pbinom(x=50, size=200, prob=0.228, lower.tail=FALSE) = 0.2030508

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾸욱 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 지수분포 (exponential distribution) : exp() (6)	2015.09.22
R 균등분포 (uniform distribution) : unif() (5)	2015.09.19
R 정규분포(Normal distribution) : norm() (0)	2015.09.16
R 포아송 분포 (Poisson distribution) : pois() (7)	2015.09.15
R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k) (0)	2015.09.13

Posted by Rfriend

R 동적 그래프 (Interactive Plotting in R) : with manipulate package in Rstudio

R 분석과 프로그래밍/R 그래프_시각화 2015. 9. 12. 12:51

R의 그래프/시각화 기능이 매우 다양함을 그동안의 여러개의 포스팅을 통해서 충분히 피부로 느끼셨을 것으로 생각합니다. 그런데 이게 다가 아입니다.

이번 포스팅에서는 Rstudio 에 manipulate 패키지를 설치해서 동적으로 그래프를 조작하는 방법을 소개하겠습니다. (참고로, R에는 shiny, plotly, rpivotTable 등 동적 그래프, 피봇테이블 지원하는 패키지가 여럿 있습니다)

그동안 소개했던 그래프/시각화 방법이 한번 그리고 나면 세팅이 된 상태에서 한번 그려지고 끝입니다. 옵션이나 대상 객체를 바꾸고 싶으면 프로그램 script 창으로 가서 프로그래을 손봐야 하는 번거로움이 있어야 했습니다.

하지만, 이번에 소개하는 manipulate 패키지를 활용한 Rstudio 내에서의 동적 그래프 (Interactive Plotting in Rstudio with manipulate package) 를 보시면 편하고 신기하다는 생각을 하게 될 것 같습니다. manipulate 패키지를 활용하면 동적 그래프 짜는 프로그램이 어렵지도 않습니다.

아래의 3개 유형과 이들을 조합한 4번째 예제를 순서대로 소개하겠습니다. 한국말로 번역하려니 쉽지가 않아서 영어 그대로 표기합니다. ^^;

Slider Control)
Picker Control)
Checkbox Control)
Combining Controls

실습에 사용할 데이터는 MASS 패키지에 내장된 Cars93 데이터 프레임의 차종(Type), 가격(Price), 고속도로연비(MPG.highway), 무게(Weight) 변수를 사용하겠습니다.

> library(MASS)
> str(Cars93)
'data.frame':	93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

(0) manipulate package installation

먼저, manipulate 패키지를 별도 설치하고 호출해 보겠습니다.

> install.packages("manipulate")
Installing package into ‘C:/Users/user/Documents/R/win-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.2/manipulate_1.0.1.zip'
Content type 'application/zip' length 35812 bytes (34 KB)
downloaded 34 KB

package ‘manipulate’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\user\AppData\Local\Temp\RtmpGaEGZu\downloaded_packages

> library(manipulate)

(1) Slider Control

slider(start point, end point, step= , initial = ) 의 형태로 슬라이더 조작 옵션을 설정하고, 이 옵션 객체를 그래프의 조작하고자 하는 부분에 할당을 하면 됩니다. 말이 좀 어려운데요, 아래 히스토그램에서 Bin size 를 3부터 100까지의 범위 내에서 5 씩 증가하게끔 해놓고, 처음 만들어졌을 때의 디폴트는 20으로 설정한 예제입니다. 동적 그래프의 특성을 직관적으로 이해할 수 있도록 화면캡쳐와 함께 동영상 캡쳐도 해서 올립니다. 왼쪽의 슬라이드 바를 좌우로 조절할 때마다 우측 plots 창의 히스토그램이 어떻게 바뀌는지 확인해보기 바랍니다.

> library(MASS)

> ## Slider Control
> 
> manipulate(
+   hist(Cars93$Price, breaks = bin_slider),  
+   
+   bin_slider=slider(3,100, step=5, initial = 20))

(2) Picker Control

히스토그램에 대상 변수를 선택할 수 있는 Picker Control 예제입니다.

> ## Picker Control
> 
> manipulate(
+   hist(Cars93[, continuous_variable], 
+           freq = FALSE, main = continuous_variable),
+   
+   continuous_variable = picker("MPG.highway", "Weight", "Price"))

산포도의 x축, 무게(Weight)과 y축, 고속도로연비(MPG.highway)은 정해져있고, 차종(Type)별로 산포도를 보고 싶을 때 차종(Type)을 왼쪽의 Picker Control 상자로 만들어서 바로 바로 차종별로 선택해 가면서 산포도를 보는 프로그램 예제입니다.

> manipulate(
+   plot(MPG.highway ~ Weight, data=Cars93[Cars93$Type == Type,]),
+   
+   Type = picker("Compact", "Large", "Midsize", "Small", "Sporty", "Van"))

(3) Checkbox Control

아래는 Box Plot 에서 IQR(Inter Quartile Range)의 1.5배 기준으로 계산된 Outlier 를 포함시켜서 제시를 할지 아니면 제외시킬지를 Checkbox 로 선택할 수 있게 한 예제입니다.

> ## Checkbox Control
> 
> manipulate(
+   boxplot(Price ~ Type, data = Cars93, outline = outline),
+   
+   outline = checkbox(FALSE, "Show outliers"))

(4) Combining Controls

이번에는 Picker Control과 Slider Control 두 개를 함께 사용해 보는 예제입니다. Picker Control 로 대상 변수를 선택할 수 있게 하였고, Slider Control 로 Histogram 의 Bin size 를 조절할 수 있도록 해보았습니다.

> ## Combining Controls > > manipulate( + hist(Cars93[, continuous_variable], + breaks = bin_slider, + freq = FALSE, main = continuous_variable), + + continuous_variable = picker("MPG.highway", "Weight", "Price"), + bin_slider = slider(5,50, step=5, initial = 10) + )

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡' 단추를 꾸욱 눌러주세요.^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 그래프_시각화' 카테고리의 다른 글

R 그래프 모수 (Graphical Parameters) : 그래프 모수 설정하는 2가지 방법 (0)	2015.12.24
R 그래프 Base Graphics plotting system : hist() boxplot(), stem(), barplot(), dotchart(), pie(), plot() (4)	2015.12.24
R ggplot2 연속확률분포 곡선, stat_function() (0)	2015.09.07
R ggplot2 히트맵(Heat map) 그리기 : geom_tile(), geom_raster() (12)	2015.09.06
R ggplot2 범주형 축 그룹(요인) 순서 바꾸기 : scale_x_discrete(limits=...) (3)	2015.09.06

Posted by Rfriend

R 사용자 정의 함수 (User Defined Function)

R 분석과 프로그래밍/R 프로그래밍 2015. 9. 11. 00:29

R분석을 하다 보면 데이터 전처리 라든지 그래프 그리기, 혹은 모형 개발/ update 등을 하는데 있어 반복 작업을 하는 경우가 있습니다.

이때 대상 데이터셋이라든지 변수, 혹은 조건 등을 조금씩 바꿔가면서 반복 작업을 (반)자동화 하고 싶을 때 유용하게 사용할 수 있는 것이 사용자 정의 함수 (User Defined Function) 입니다.

만약 사용자 정의 함수를 사용하지 않는다면 특정 부분만 바뀌고 나머지는 동일한 프로그램이 매우 길고 복잡하고 산만하게 늘어세울 수 밖에 없게 됩니다. 반면 사용자 정의 함수를 사용하면 사용자 정의 함수 정의 후에 바뀌는 부분만 깔끔하게 사용자 정의 함수의 입력란에 바꿔서 한줄 입력하고 실행하면 끝입니다. 반복작업이 있다 싶으면 손과 발의 노가다를 줄이고 작업/분석 시간을 줄이는 방법, 프로그래밍을 간결하고 깔끔하게 짜는 방법으로 사용자 정의 함수를 사용할 여지가 있는지 살펴볼 필요가 있겠습니다.

사용자 정의 함수는

function_name <- function( arg1, arg2, ... ) {

expression

return( object)

}

의 형식을 따릅니다.

몇 가지 예을 들어서 설명해보겠습니다.

1) 평균(mean), 표준편차(standard deviation), min, max 계산 사용자 정의 함수
(User defined function of statistics for continuous variable)

> # 평균, 표준편차, min, max 계산
> 
> stat_function <- function(x) {
+   x_mean = mean(x)
+   x_sd = sd(x)
+   x_min = min(x)
+   x_max = max(x)
+   x_summary = list(x_mean=x_mean, x_sd=x_sd, x_min=x_min, x_max=x_max)
+   return(x_summary)
+ }
> 
> stat_function(x = Cars93$MPG.highway)
$x_mean
[1] 29.08602

$x_sd
[1] 5.331726

$x_min
[1] 20

$x_max
[1] 50

> # summary() 함수와 비교 
> summary(Cars93$MPG.highway)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.00   26.00   28.00   29.09   31.00   50.00

2) 산점도 그래프 그리기 사용자 정의 함수 (User defined function of scatter plot)

> # 산점도 그래프 그리기 함수 (scatter plot)
> plot_function <- function(dataset, x, y, title) {
+   attach(dataset)
+   plot(y ~ x, dataset, type="p", 
+        main = title)
+   detach(dataset)

> plot_function(dataset=Cars93, x=MPG.highway, y=Weight, title="Scatter Plot of MPG.highway & Weight")


> plot_function(dataset=Cars93, x=Price, y=Horsepower, title="Scatter Plot of Price & Horsepower")

위의 기초통계량은 summary() 함수를 사용하면 되고 산포도도 plot() 함수를 쓰는 것과 별 차이가 없어보여서 사용자 정의 함수를 쓰는 것이 뭐가 매력적인지 잘 이해가 안갈 수도 있을 것 같습니다. 하지만 만약 기초 통계량을 뽑아서 txt 파일로 외부로 내보내기를 하고, x 변수를 바꿔가면서 loop를 돌려서 반복적으로 기초 통계량을 뽑고 이것을 계속 txt 파일로 외부로 내보내기를 하되, 앞서 내보냈던 파일에 계속 append 를 해가면서 결과값을 저장한다고 할때는 위의 사용자 정의 함수를 사용하는 것이 정답입니다.

그래프도 변수명의 일부분을 바꿔가면서 그래프를 그리고 싶을 때는 paste() 함수를 적절히 사용하면 사용자 정의 함수를 더욱 강력하게 사용할 수 있게 됩니다. 응용하기 나름이고, 사용 가능한 경우가 무궁무진한데요, 이번 포스팅에서는 사용자 정의 함수의 기본 뼈대에 대해서만 간략히 살펴 보았습니다.

참고로, 사용자 정의 함수를 정의할 때 아래처럼 function(x, y, ...) 의 파란색 생략부호 점을 입력하면 나중에 사용자 정의 함수에서 정의하지 않았던 부가적인 옵션들을 추가로 덧붙여서 사용할 수 있어서 유연성이 높아지는 효과가 있습니다.

function_name <- function(x, y, ...) {

expresstion

return(object)

}

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡' 단추를 꾸욱 눌러주세요.^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 프로그래밍' 카테고리의 다른 글

[R] 폴더 내 여러개 파일들을 Loop 돌려서 자동으로 불러오기 (154)	2016.07.02
R "target of assignment expands to non-language object" error : assign() (2)	2015.09.22
ggplot2 저장 : ggsave(), console 내용 text 저장 : capture.output() (0)	2015.09.17
R 조건 연산 프로그래밍 (conditional process programming) : if(), elseif() (19)	2015.09.05
R 반복 연산 프로그래밍(Loop process programming) : for(), while() (17)	2015.09.05

Posted by Rfriend

R ggplot2 연속확률분포 곡선, stat_function()

R 분석과 프로그래밍/R 그래프_시각화 2015. 9. 7. 23:58

통계에서 빼놓을 수 없는 기본 개념 중의 하나가 확률입니다. 모집단에서 표본을 추출할 때 랜덤 샘플링, 층화 랜덤 샘플링 등과 같이 확률을 사용합니다. 추정과 검정에서도 확률분포를 사용합니다. 회귀분석, 판별분석 등에서도 변수가 정규분포를 따르고 있는지 검정합니다. 시뮬레이션을 할 때 모집단의 확률분포에 따라 난수를 발생시키기도 합니다.

특히, 통계를 좀 공부했던 분이라면 정규분포는 알고 있을 듯 합니다. 하지만, 그 외에 분포들은 들어는 봤어도 모양이 어떻게 생겼는지, 어떤 때 사용하는 것인지 정확히 모르고 있는 경우가 더 많을 듯 합니다.

R ggplot2를 활용해서 연속확률분포 곡선을 그려보면 분포별로 모양을 이해하는데 도움이 되겠지요. 그리고 모수에 따라서 모양이 어떻게 바뀌는지도 확인해 볼 수 있겠구요.

이번 포스팅에서는 주로 'd'로 시작하는 밀도 함수 (Density Function) 에 대해서 정규분포(norm), t-분포(t), 카이제곱분포(chisq), 지수분포(exp), F분포(f), 감마분포(gamma), 균등분포(unif) 등의 분포에 대해서 ggplot2로 그리는 방법을 소개해보겠습니다.

[ 연속확률분포 종류별 / 함수 종류별 ggplot2 그리기 함수 종합표 ]

분포		밀도 함수 d	누적분포 함수 p	분위수 함수 q	난수 발생 r
정규분포	norm()	dnorm()	pnorm()	qnorm()	rnorm()
t-분포	t()	dt()	pt()	qt()	rt()
카이제곱분포	chisq()	dchisq()	pchisq()	qchisq()	rchisq()
지수분포	exp()	dexp()	pexp()	qexp()	rexp()
F분포	f()	df()	pf()	qf()	rf()
감마분포	gamma()	dgamma()	pgamma()	qgamma()	rgamma()
균등분포	unif()	dunif()	punif()	qunif()	runif()

ggplot2는 별도의 설치 및 호출이 필요한 패키지이므로 아래의 절차를 먼저 실행합니다.

> install.packages("ggplot2")
Installing package into ‘C:/Users/user/Documents/R/win-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.2/ggplot2_1.0.1.zip'
Content type 'application/zip' length 2676292 bytes (2.6 MB)
downloaded 2.6 MB

package ‘ggplot2’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\user\AppData\Local\Temp\RtmpGAPkIo\downloaded_packages
> library(ggplot2)

(1) 정규분포 활률밀도곡선 (Normal Distribution Probability Density Curve)

: stat_function(fun = dnorm)

> # 정규분포 : fun = dnorm
> ggplot(data.frame(x=c(-3,3)), aes(x=x)) +
+   stat_function(fun=dnorm, colour="blue", size=1) +
+   ggtitle("Normal Distribution")

(2) 정규분포의 특정 구간에만 색깔 넣기 (colour at specific range of normal distribution)

> # 함수 특정 구간에 색깔 넣기
> dnorm_range <- function(x) {
+   y <- dnorm(x) 
+   y[x < -1 | x > 2] <- NA  # 이 범위에는 색깔 없음
+   return(y)
+ }
> 
> ggplot(data.frame(x=c(-3,3)), aes(x=x)) +
+   stat_function(fun=dnorm, colour="blue", size=1) +
+   stat_function(fun=dnorm_range, geom="area", fill="grey", alpha=0.5) + 
+   ggtitle("Normal Distribution of x~N(0,1) with colour from -1 to 2")

(3) 누적정규분포 (Cummulative Normal Distribution) : stat_function(fun = pnorm)

> # 누적정규분포 : fun = pnorm
> ggplot(data.frame(x=c(-3,3)), aes(x=x)) +
+   stat_function(fun=pnorm, colour="black", size=1.5) +
+   ggtitle("Cumulative Normal Distribution of x~N(0,1)")

(4) 정규분포 : 평균과 분산 지정 (Normal Distribution with specific mean and standard deviation) : stat_function(fun = dnorm, args=list(mean=2, sd=1))

> # 정규분포: 평균과 분산 지정
> ggplot(data.frame(x = c(-5, 5)), aes(x=x)) +
+   stat_function(fun=dnorm, args=list(mean=2, sd=1), colour="black", size=1.5) +
+   geom_vline(xintercept=2, colour="grey", linetype="dashed", size=1) + # 평균에 세로 직선 추가
+   geom_text(x=0, y=0.3, label="x = N(2, 1)") +
+   ggtitle("Normal Distribution of x~N(2,1)")

(5) t-분포 (t-Distribution) : stat_function(fun = dt)

> # t-분포 : fun = dt 
> ggplot(data.frame(x=c(-3,3)), aes(x=x)) +
+   stat_function(fun=dt, args=list(df=2), colour="red", size=2) +
+   ggtitle("t-Distribution of df=2")

(6) 카이제곱분포 확률밀도곡선 (Chisq Distribution Probability Density Curve)
: stat_function(fun = dchisq)

> # 카이제곱분포 : fun = dchisq
> ggplot(data.frame(x=c(0,10)), aes(x=x)) +
+   stat_function(fun=dchisq, args=list(df=1), colour="black", size=1.2) +
+   geom_text(x=0.6, y=1, label="df=1") +
+   
+   stat_function(fun=dchisq, args=list(df=2), colour="blue", size=1.2) +
+   geom_text(x=0, y=0.55, label="df=2") +
+   
+   stat_function(fun=dchisq, args=list(df=3), colour="red", size=1.2) +
+   geom_text(x=0.5, y=0.05, label="df=3") +
+   
+   ggtitle("Chisq-Distribution")

(7) 지수분포 (Exponential Distribution) : stat_function(fun = dexp)

> # 지수분포 : fun = dexp
> ggplot(data.frame(x=c(0,10)), aes(x=x)) +
+   stat_function(fun=dexp, colour="brown", size=1.5) +
+   ggtitle("Exponential Distribution")

(8) F 분포 (F Distribution) : stat_function(fun = df)

> # F분포 : fun = df
> ggplot(data.frame(x=c(0,5)), aes(x=x)) +
+   stat_function(fun=df, args=list(df1=5, df2=10), colour="purple", size=1) +
+   ggtitle("F Distribution of (df1=5, df2=10)")

(9) 감마 분포 (Gamma Distribution) : stat_function(fun = dgamma)

> # 감마 분포 : fun = dgamma
> ggplot(data.frame(x=c(0, 400)), aes(x=x)) +
+   stat_function(fun=dgamma, args=list(shape=5, rate=0.05), colour="green") +
+   ggtitle("Gamma Distribution of (shape=5, rate=0.05)")

(10) 균등분포 (Uniform Distribution) : stat_function(fun = dunif)

> # 균등분포 : fun = dunif
> ggplot(data.frame(x=c(-2,20)), aes(x=x)) +
+   stat_function(fun=dunif, args=list(min = 0, max = 10), colour="black", size=1) +
+   ggtitle("Uniform Distribution of (min=1, max=10)")

덤으로, 상용로그분포와 사인 함수, 코사인 함수 곡선도 그려보겠습니다.

(11) 상용로그분포 (Common Logarithm Distribution) : stat_function(fun = log10)

> # 상용로그분포 : fun = log10
> ggplot(data.frame(x=c(0,100)), aes(x=x)) +
+   stat_function(fun=log10, colour="black", size=1.5) +
+   geom_vline(xintercept=10, colour="grey", linetype="dashed", size=1) +
+   geom_vline(xintercept=100, colour="grey", linetype="dashed", size=1) +
+   ggtitle("Common Logarithm Distribution")

(12) 사인 함수 곡선(Sine Function Curve), 코사인 함수 곡선(Cosine Function Curve) :

stat_function(fun = sin), stat_fuction(fun = cos)

> # 사인 함수 : fun = sin, 코사인 함수 : fun = cos
> ggplot(data.frame(x=c(0,6.28)), aes(x=x)) +
+   stat_function(fun=sin, colour="blue", size=1) +
+   geom_text(x=0.2, y=0, label="sine curve") +
+   
+   stat_function(fun=cos, colour="yellow", size=1) + 
+   geom_text(x=0.2, y=1, label="cosine curve") +
+   
+   geom_vline(xintercept=3.14, colour="grey", linetype="dashed", size=1) + # pi값에 세로 직선 추가  
+   geom_vline(xintercept=6.28, colour="grey", linetype="dashed", size=1) + # 2pi값에 세로 직선 추가  
+   ggtitle("Sine(blue curve), Cosine(yellow curve) Function")

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡' 단추를 꾸욱 눌러주세요.^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 그래프_시각화' 카테고리의 다른 글

R 그래프 Base Graphics plotting system : hist() boxplot(), stem(), barplot(), dotchart(), pie(), plot() (4)	2015.12.24
R 동적 그래프 (Interactive Plotting in R) : with manipulate package in Rstudio (0)	2015.09.12
R ggplot2 히트맵(Heat map) 그리기 : geom_tile(), geom_raster() (12)	2015.09.06
R ggplot2 범주형 축 그룹(요인) 순서 바꾸기 : scale_x_discrete(limits=...) (3)	2015.09.06
R ggplot2 분할 면에 각각 주석 넣기 : geom_text() (0)	2015.09.06

Posted by Rfriend

이전 1 ··· 19 20 21 22 23 24 25 ··· 29 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'R 분석과 프로그래밍'에 해당되는 글 289건

R 균등분포 (uniform distribution) : unif()

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

ggplot2 저장 : ggsave(), console 내용 text 저장 : capture.output()

'R 분석과 프로그래밍 > R 프로그래밍' 카테고리의 다른 글

R clearing of console, datasets, plots in RStudio

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

R 정규분포(Normal distribution) : norm()

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 포아송 분포 (Poisson distribution) : pois()

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 초기하분포 (Hypergeometric distribution) : dhyper(x, m, n, k)

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 이항분포 (Binomial Distribution) : dbinom()

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

R 동적 그래프 (Interactive Plotting in R) : with manipulate package in Rstudio

'R 분석과 프로그래밍 > R 그래프_시각화' 카테고리의 다른 글

R 사용자 정의 함수 (User Defined Function)

'R 분석과 프로그래밍 > R 프로그래밍' 카테고리의 다른 글

R ggplot2 연속확률분포 곡선, stat_function()

'R 분석과 프로그래밍 > R 그래프_시각화' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바