R, Python 분석과 프로그래밍의 친구 (by R Friend)

'size'에 해당되는 글 2건

2019.01.13 [Python] 4개 변수로 점의 크기와 색깔을 다르게 산점도 그리기 (Scatter plot with 4 variables, different size & color) (3/4) 2
2017.01.21 [Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation) 4

[Python] 4개 변수로 점의 크기와 색깔을 다르게 산점도 그리기 (Scatter plot with 4 variables, different size & color) (3/4)

Python 분석과 프로그래밍/Python 그래프_시각화 2019. 1. 13. 23:47

이번 포스팅은 두 개의 연속형 변수에 대한 관계를 파악하는데 유용하게 사용할 수 있는 산점도(Scatter Plot) 의 세번째 포스팅으로서 4개의 연속형 변수를 사용하여 X축, Y축, 점의 색깔(color)과 크기(size)을 다르게 하는 방법을 소개합니다.

즉, 산점도를 사용하여 4차원의 데이터를 2차원에 시각화하는 방법입니다.

(1) 산점도 (Scatter Plot)

(2) 그룹별 산점도 (Scatter Plot by Groups)

(3) 4개 변수로 점의 크기와 색을 다르게 산점도 그리기 (Scatter plot with different size, color)

(4) 산점도 행렬 (Scatter Plot Matrix)

예제로 활용할 데이터는 iris 의 4개의 연속형 변수들입니다.

# importing libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

plt.rcParams['figure.figsize'] = [10, 8] # setting figure size

# loading 'iris' dataset from seaborn

iris = sns.load_dataset('iris')

iris.shape

(150, 5)

iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

(1) matplotlib에 의한 4개 연속형 변수를 사용한 산점도 (X축, Y축, 색, 크기)

plt.scatter() 함수를 사용하며, 점의 크기는 s, 점의 색깔은 c 에 변수를 할당해주면 됩니다.

# 4 dimensional scatter plot with different size & color

plt.scatter(iris.sepal_length, # x

iris.sepal_width, # y

alpha=0.2,

s=200*iris.petal_width, # marker size

c=iris.petal_length, # marker color

cmap='viridis')

plt.title('Scatter Plot with Size(Petal Width) & Color(Petal Length)', fontsize=14)

plt.xlabel('Sepal Length', fontsize=12)

plt.ylabel('Sepal Width', fontsize=12)

plt.colorbar()

plt.show()

점(marker)의 모양을 네모로 바꾸고 싶으면 marker='s' 로 설정해주면 됩니다.

# 4 dimensional scatter plot with different size & color

plt.scatter(iris.sepal_length, # x

iris.sepal_width, # y

alpha=0.2,

s=200*iris.petal_width, # marker size

c=iris.petal_length, # marker color

cmap='viridis',

marker = 's') # square shape

plt.title('Size(Petal Width) & Color(Petal Length) with Square Marker', fontsize=14)

plt.xlabel('Sepal Length', fontsize=12)

plt.ylabel('Sepal Width', fontsize=12)

plt.colorbar()

plt.show()

(2) seaborn에 의한 4개 연속형 변수를 사용한 산점도 (X축, Y축, 색, 크기)

seaborn 의 산점도 코드는 깔끔하고 이해하기에 쉬으며, 범례도 잘 알아서 색깔과 크기를 표시해주는지라 무척 편리합니다.

# 4 dimensional scatter plot by seaborn

sns.scatterplot(x='sepal_length',

y='sepal_width',

hue='petal_length',

size='petal_width',

data=iris)

plt.show()

(3) pandas에 의한 4개 연속형 변수를 사용한 산점도 (X축, Y축, 색, 크기)

pandas의 DataFrame에 plot(kind='scatter') 로 해서 color=iris['petal_length']로 색깔을 설정, s=iris['petal_width'] 로 크기를 설정해주면 됩니다. pandas 산점도 코드도 깔끔하고 이해하기 쉽긴 한데요, 범례 추가하기가 쉽지가 않군요. ^^;

iris.plot(kind='scatter',

x='sepal_length',

y='sepal_width',

color=iris['petal_length'],

s=iris['petal_width']*100)

plt.title('Size(Petal Width) & Color(Petal Length) with Square Marker', fontsize=14)

plt.show()

참고로, 산점도의 점(marker)의 모양(shape)을 설정하는 심벌들은 아래와 같으니 참고하시기 바랍니다.

# set the size

plt.rcParams['figure.figsize'] = [10, 8]

# remove ticks and values of axis

plt.xticks([])

plt.yticks([])

# markers' shape

all_shape=['.','o','v','^','>','<','s','p','*','h','H','D', 'd', '', '', '']

num = 0

for x in range(1, 5):

for y in range(1, 5):

num += 1

plt.plot(x, y,

marker = all_shape[num-1],

markerfacecolor='green',

markersize=20,

markeredgecolor='black')

plt.text(x+0.1, y,

all_shape[num-1],

horizontalalignment='left',

size='medium',

color='black',

weight='semibold')

plt.title('Markers', fontsize=20)

plt.show()

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. ^^

다음번 포스팅에서는 산점도 행렬 (Scatter Plot Matrix) 에 대해서 소개하겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 그래프_시각화' 카테고리의 다른 글

[Python] 선 그래프 (Line Graph) (3)	2019.01.16
[Python] 산점도 행렬 (Scatterplot Matrix) (4/4) (0)	2019.01.15
[Python] 그룹별 산점도 점 색깔과 모양 다르게 하기 (Scatter Plot by Groups) (2/4) (3)	2019.01.13
[Python] 산점도 그래프 (Scatter Plot) (1/4) (0)	2019.01.13
[Python] 원 그래프 (Pie Chart), 하위 그룹을 포함한 도넛 그래프 (Donut Chart with Subgroups) (1)	2019.01.12

Posted by Rfriend

[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation)

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 1. 21. 22:39

이번 포스팅에서는 시간과 비용 문제로 전수 조사를 못하므로 표본 조사를 해야 할 때, 기계학습 할 때 데이터셋을 훈련용/검증용/테스트용으로 샘플링 할 때, 또는 다양한 확률 분포로 부터 데이터를 무작위로 생성해서 시뮬레이션(simulation) 할 때 사용할 수 있는 무작위 난수 만들기(generating random numbers, random sampling)에 대해서 알아보겠습니다.

Python NumPy는 매우 빠르고(! 아주 빠름!!) 효율적으로 무작위 샘플을 만들 수 있는 numpy.random 모듈을 제공합니다.

NumPy 를 불러오고, 정규분포(np.random.normal)로 부터 개수가 5개(size=5)인 무작위 샘플을 만들어보겠습니다. 무작위 샘플 추출을 할 때마다 값이 달라짐을 알 수 있습니다.

In [1]: import numpy as np

In [2]: np.random.normal(size=5)

Out[2]: array([-0.02030555, 0.38279633, -1.02369692, 1.48083476, -0.44058273])

In [3]: np.random.normal(size=5) # array with different random numbers

Out[3]: array([ 1.11942454, -1.03486318, 1.69015608, -0.43601241, -1.52195043])

먼저, seed와 size 모수 설정하는 것부터 소개합니다.

seed : 난수 생성 초기값 부여

난수 생성 할 때 마다 값이 달라지는 것이 아니라, 누가, 언제 하든지 간에 똑같은 난수 생성을 원한다면 (즉, 재현가능성, reproducibility) seed 번호를 지정해주면 됩니다.

# seed : setting the seed number for random number generation for reproducibility

In [4]: np.random.seed(seed=100)

In [5]: np.random.normal(size=5)

Out[5]: array([-1.74976547, 0.3426804 , 1.1530358 , -0.25243604, 0.98132079])

# exactly the same with the above random numbers

In [6]: np.random.seed(seed=100)

In [7]: np.random.normal(size=5) # 위의 결과랑 똑같음

Out[7]: array([-1.74976547, 0.3426804 , 1.1530358 , -0.25243604, 0.98132079])

size : 샘플 생성(추출) 개수 및 array shape 설정

다차원의 array 형태로 무작위 샘플을 생성할 수 있다는 것도 NumPy random 모듈의 장점입니다.

# size : int or tuple of ints for setting the shape of nandom number array

In [8]: np.random.normal(size=2)

Out[8]: array([ 0.51421884, 0.22117967])

In [9]: np.random.normal(size=(2, 3))

Out[9]:

array([[-1.07004333, -0.18949583, 0.25500144],

[-0.45802699, 0.43516349, -0.58359505]])

In [10]: np.random.normal(size=(2, 3, 4))

Out[10]:

array([[[ 0.81684707, 0.67272081, -0.10441114, -0.53128038],

[ 1.02973269, -0.43813562, -1.11831825, 1.61898166],

[ 1.54160517, -0.25187914, -0.84243574, 0.18451869]],

[[ 0.9370822 , 0.73100034, 1.36155613, -0.32623806],

[ 0.05567601, 0.22239961, -1.443217 , -0.75635231],

[ 0.81645401, 0.75044476, -0.45594693, 1.18962227]]])

다양한 확률 분포로부터 난수를 생성해보겠습니다. 먼저, 정수를 뽑는 이산형 확률 분포(discrete probability distribution)인 (1-1) 이항분포, (1-2) 초기하분포, (1-3) 포아송분포로 부터 무작위 추출하는 방법을 알아보겠습니다.

각 확률분포에 대한 설명까지 곁들이면 포스팅이 너무 길어지므로 참고할 수 있는 포스팅 링크를 걸어놓는 것으로 갈음합니다.

- 이항분포 (Binomial Distribution) : http://rfriend.tistory.com/99

- 초기하분포 (Hypergeometric distribution) : http://rfriend.tistory.com/100

- 포아송 분포 (Poisson Distribution) : http://rfriend.tistory.com/101

(1-1) 이항분포로 부터 무작위 표본 추출 (Random sampling from Binomial Distribution) : np.random.binomial(n, p, size)

앞(head) 또는 뒤(tail) (n=1) 가 나올 확률이 각 50%(p=0.5)인 동전 던지기를 20번(size=20) 해보았습니다.

# (1) 이산형 확률 분포 (Discrete Probability Distribution)
# (1-1) 이항분포 (Binomial Distribution) : np.random.binomial(n, p, size)
# : 복원 추출 (sampling with replacement)
# : n an integer >= 0 and p is in the interval [0,1]

In [11]: np.random.binomial(n=1, p=0.5, size=20)

Out[11]: array([0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

In [12]: sum(np.random.binomial(n=1, p=0.5, size=100) == 1)/100

Out[12]: 0.46999999999999997

(1-2) 초기하분포에서 무작위 표보 추출 (Random sampling from Hypergeometric distribution) : np.random.hypergeometric(ngood, nbad, nsample, size)

good 이 5개, bad 가 20개인 모집단에서 5개의 샘플을 무작위로 비복원추출(random sampling without replacement) 하는 것을 100번 시뮬레이션 한 후에, 도수분포표를 구해서, 막대그래프로 나타내보겠습니다.

# (1-2) 초기하분포 (Hypergeometric distribution)
# : 비복원 추출(sampling without replacement)
# : np.random.hypergeometric(ngood, nbad, nsample, size=None)

In [13]: np.random.seed(seed=100)

In [14]: rand_hyp = np.random.hypergeometric(ngood=5, nbad=20, nsample=5, size=100)

In [15]: rand_hyp

Out[15]:

array([1, 1, 1, 1, 1, 1, 0, 3, 0, 1, 2, 0, 1, 1, 2, 0, 1, 1, 0, 0, 0, 3, 2,
        0, 0, 0, 1, 3, 1, 1, 0, 0, 1, 0, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2,
        1, 0, 1, 1, 2, 1, 2, 1, 2, 0, 1, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0,
        2, 0, 1, 2, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 2, 1, 1, 0, 0, 2, 1, 1,
        1, 1, 1, 1, 1, 3, 1, 0])

# result table of 100 simulation

In [16]: unique, counts = np.unique(rand_hyp, return_counts=True)

In [17]: np.asarray((unique, counts)).T

Out[17]:

array([[ 0, 27],
        [ 1, 53],
       [ 2, 16],
      [ 3, 4]], dtype=int64)

# bar plot

In [18]: import matplotlib.pyplot as plt

In [19]: plt.bar(unique, counts, width=0.5, color="blue", align='center')

Out[19]: <Container object of 4 artists>

(1-3) 포아송분포로 부터 무작위 표본 추출 : np.random.poisson(lam, size)

(random sampling from Poisson distribution)

일정한 단위 시간, 혹은 공간에서 무작위로 발생하는 사건의 평균 회수인 λ(lambda)가 20인 포아송 분포로 부터 100개의 난수를 만들어보겠습니다. 그 후에 도수를 계산하고, 막대그래프로 분포를 그려보겠습니다.

# (1-3) 포아송 분포 (Poisson Distribution)
# np.random.poisson(lam=1.0, size=None)
# Poisson distribution is the limit of the binomial distribution for large N

In [20]: np.random.seed(seed=100)

In [21]: rand_pois = np.random.poisson(lam=20, size=100)

In [22]: rand_pois

Out[22]:

array([21, 19, 22, 14, 26, 15, 25, 25, 19, 25, 15, 24, 21, 13, 26, 23, 21,
      16, 24, 17, 18, 18, 15, 18, 22, 28, 21, 18, 17, 31, 23, 13, 20, 19,
        24, 17, 20, 13, 19, 16, 16, 21, 16, 21, 19, 20, 20, 19, 19, 20, 13,
        29, 9, 13, 20, 29, 15, 15, 21, 20, 21, 18, 16, 20, 23, 18, 22, 14,
        19, 20, 18, 17, 20, 24, 20, 15, 19, 19, 25, 17, 19, 27, 20, 17, 12,
      22, 16, 23, 17, 11, 15, 19, 16, 21, 21, 25, 26, 23, 15, 25])

In [23]: unique, counts = np.unique(rand_pois, return_counts=True)

In [24]: np.asarray((unique, counts)).T

Out[24]:

array([[ 9, 1],
      [11, 1],
        [12, 1],
        [13, 5],
      [14, 2],
      [15, 8],
      [16, 7],
      [17, 7],
        [18, 7],
        [19, 12],
        [20, 12],
        [21, 10],
        [22, 4],
      [23, 5],
       [24, 4],
      [25, 6],
        [26, 3],
        [27, 1],
        [28, 1],
        [29, 2],
        [31, 1]], dtype=int64)

In [25]: plt.bar(unique, counts, width=0.5, color="red", align='center')

Out[25]: <Container object of 21 artists>

다음으로 연속형 확률분포(continuous probability distribution)인 (2-1) 정규분포, (2-2) t-분포, (2-3) 균등분포, (2-4) F 분포, (2-5) 카이제곱분포로부터 난수를 생성하는 방법을 소개합니다.

각 분포별 이론 설명은 아래의 포스팅 링크를 참조하세요.

- 정규분포 (Normal Distribution) : http://rfriend.tistory.com/102

- t-분포 (Student's t-distribution) : http://rfriend.tistory.com/110

- 균등분포 (Uniform Distribution) : http://rfriend.tistory.com/106

- F-분포(F-distribution) : http://rfriend.tistory.com/111

- 카이제곱분포 (Chisq-distribution) : http://rfriend.tistory.com/112

(2-1) 정규분포로부터 무작위 표본 추출 : np.random.normal(loc, scale, size)

(random sampling from Normal Distribution)

평균이 '0', 표준편차가 '3'인 정규분포로 부터 난수 100개를 생성해보고, 히스토그램을 그려서 분포를 bin 구간별 빈도(frequency)와 표준화한 비율(normalized percentage)로 살펴보겠습니다.

# (2) 연속형 확률분포 (continuous probability distribution)
# (2-1) 정규분포(normal distribution)로부터 난수 생성
# Draw random samples from a normal (Gaussian) distribution
# np.random.normal(loc=0.0, scale=1.0, size=None)
# mu : Mean (“centre”) of the distribution
# sigma : Standard deviation (spread or “width”) of the distribution
# size : Output shape

In [26]: np.random.seed(100)

In [27]: mu, sigma = 0.0, 3.0

In [28]: rand_norm = np.random.normal(mu, sigma, size=100)

In [29]: rand_norm

Out[29]:

array([-5.24929642, 1.02804121, 3.45910741, -0.75730811, 2.94396236,
        1.54265652, 0.66353901, -3.21012999, -0.56848749, 0.76500433,
       -1.37408096, 1.30549046, -1.75078515, 2.45054122, 2.01816242,
       -0.31323343, -1.59384113, 3.08919806, -1.31440687, -3.35495474,
        4.85694498, 4.62481552, -0.75563742, -2.52730721, 0.55355607,
        2.8112466 , 2.19300103, 4.08466838, -0.97871418, 0.16702804,
        0.66719883, -4.32965099, -2.26905692, 2.44936203, 2.25133428,
       -1.36784078, 3.5688668 , -5.07185048, -4.06919715, -3.69730354,
       -1.63331749, -2.00451521, 0.02194369, -1.83881621, 3.89924422,
       -5.19928687, -2.9499303 , 1.07252326, -4.84073551, 4.4121416 ,
       -3.56405279, -1.64923858, -2.82013848, -2.48379709, 0.3265904 ,
        1.52342877, -2.58668204, 3.74840923, -0.23883374, -2.66919444,
       -2.64539517, 0.05591685, 0.71353387, 0.04064565, -4.9065882 ,
       -3.13262963, 1.83911665, 2.20861564, 3.08076432, -4.29657183,
       -5.5235649 , 1.09827968, -0.99533141, -2.06765393, 6.10382268,
       -1.65214324, 2.25135999, -3.92097702, 1.74172001, -3.31356928,
        2.07036441, 2.0606702 , -4.70006259, 2.71492236, 2.3364672 ,
        1.28469861, 0.32661597, 0.0848509 , -1.73647747, -3.5983536 ,
       -5.11785602, 1.10749187, 5.62972028, -1.13071005, 5.49580825,
        0.0090523 , -0.2280704 , 0.01187278, -0.55504233, -7.46145461])

In [30]: import matplotlib.pyplot as plt

# histogram with frequency

In [31]: count, bins, ignored = plt.hist(rand_norm, normed=False)

# histogram with normalized percentage
In [32]: count, bins, ignored = plt.hist(rand_norm, normed=True)

(2-2) t-분포로 부터 무작위 표본 추출 : np.random.standard_t(df, size)

(Random sampling from t-distribution)

자유도(degrees of freedom)가 '3'인 t-분포로부터 100개의 난수를 생성하고, 히스토그램을 그려보겠습니다.

# (2-2) t-분포 (Student's t-distribution)로부터 난수 생성
# Draw samples from a standard Student’s t distribution with df degrees of freedom
# np.random.standard_t(df, size=None)
# df : Degrees of freedom
# size : Output shape

In [33]: np.random.seed(100)

In [34]: rand_t = np.random.standard_t(df=3, size=100)

In [35]: rand_t

Out[35]:

array([-1.70633623, 0.61010003, 0.45753218, -0.85709656, -0.42990712,
       -0.7437467 , 0.8444005 , -0.4040428 , 2.13905276, -0.10844638,
        0.67238716, 1.88720362, -2.57340231, -0.69724955, -3.40107659,
       -0.57745433, -0.36487447, 3.95862541, 2.34665412, -0.94310449,
        0.81852816, -0.48391289, 0.01380029, -0.43003718, -2.25784604,
       -0.18216847, -1.21433582, 0.46347964, 0.50024665, -1.1595865 ,
        0.02358778, -1.18879826, -0.38767689, 2.24289791, -2.80798472,
       -2.838893 , -0.39222432, -1.61499121, -1.78498184, 0.44618923,
       -1.5181203 , 5.44389927, 4.17743903, -0.49617121, -0.02996529,
        0.89595015, 1.14860485, -3.16541308, 0.14279246, 0.83121743,
       -0.32403947, 0.59297222, -0.39750861, 0.57634934, 0.81587478,
       -1.29367024, -0.28580516, -0.48422765, -0.83697192, 0.50702557,
       -1.98915687, 2.92965716, -1.19522074, 0.65511251, 2.12055605,
       -0.03640814, -0.41931018, 3.31199804, -0.61725596, 0.79681204,
        1.86805014, -0.54345259, 3.11909936, 0.86410458, 2.66353682,
        0.23735454, -0.76306875, 0.24471792, -0.13515045, 0.26402784,
        4.68946895, 0.70573709, -0.17783758, 1.85205955, -0.18352788,
       -0.65713104, -0.73674278, 2.16549569, 1.22326388, -0.5112858 ,
       -1.54451989, -1.73428432, 0.46947115, 1.66594804, 0.51687137,
        1.51361314, -2.22193709, 0.89557421, 0.56222653, -0.55564416])

# histogram

In [36]: import matplotlib.pyplot as plt

In [37]: count, bins, ignored = plt.hist(rand_t, bins=20, normed=True)

(2-3) 균등분포로 부터 무작위 표본 추출 : np.random.uniform(low, high, size)

(random sampling from Uniform distribution)

최소값이 '0', 최대값이 '10'인 구간에서의 균등분포에서 100개의 난수를 만들어 보고, 히스토그램을 그려서 분포를 확인해보겠습니다.

# (2-3) 균등분포 (Uniform Distribution)로 부터 난수 생성
# Draw samples from a uniform distribution
# np.random.uniform(low=0.0, high=1.0, size=None)
# low : Lower boundary of the output interval
# high : Upper boundary of the output interval
# [low, high) : includes low, excludes high

In [38]: np.random.seed(100)

In [39]: rand_unif = np.random.uniform(low=0.0, high=10.0, size=100)

In [40]: rand_unif

Out[40]:

array([ 5.43404942, 2.78369385, 4.24517591, 8.44776132, 0.04718856,
        1.21569121, 6.70749085, 8.25852755, 1.3670659 , 5.75093329,
        8.91321954, 2.09202122, 1.8532822 , 1.0837689 , 2.19697493,
        9.78623785, 8.11683149, 1.71941013, 8.16224749, 2.74073747,
        4.31704184, 9.4002982 , 8.17649379, 3.3611195 , 1.75410454,
        3.72832046, 0.05688507, 2.52426353, 7.95662508, 0.15254971,
        5.98843377, 6.03804539, 1.05147685, 3.81943445, 0.36476057,
        8.90411563, 9.80920857, 0.59941989, 8.90545945, 5.76901499,
        7.42479689, 6.30183936, 5.81842192, 0.20439132, 2.10026578,
        5.44684878, 7.69115171, 2.50695229, 2.8589569 , 8.52395088,
        9.75006494, 8.84853293, 3.59507844, 5.98858946, 3.54795612,
        3.40190215, 1.7808099 , 2.37694209, 0.44862282, 5.0543143 ,
        3.76252454, 5.92805401, 6.29941876, 1.42600314, 9.33841299,
        9.46379881, 6.02296658, 3.8776628 , 3.63188004, 2.04345277,
        2.76765061, 2.46535881, 1.73608002, 9.66609694, 9.570126 ,
        5.97973684, 7.31300753, 3.40385223, 0.92055603, 4.63498019,
        5.08698893, 0.88460173, 5.28035223, 9.92158037, 3.95035932,
        3.35596442, 8.05450537, 7.54348995, 3.13066442, 6.34036683,
        5.40404575, 2.96793751, 1.10787901, 3.12640298, 4.5697913 ,
        6.5894007 , 2.54257518, 6.41101259, 2.00123607, 6.57624806])

# histogram

In [41]: import matplotlib.pyplot as plt

In [42]: count, bins, ignored = plt.hist(rand_unif, bins=10, normed=True)

이산형 균등분포에서 정수형 무작위 표본 추출 : np.random.randint(low, high, size)

(Random INTEGERS from discrete uniform distribution)

이산형 균등분포(discrete uniform distribution)으로 부터 정수형(integers)의 난수를 만들고 싶으면 np.random.randint 를 사용하면 됩니다. 모수 설정은 np.random.uniform()과 같은데요, high에 설정하는 값은 미포함(not include, but exclude)이므로 만약 0부터 10까지 (즉, 10도 포함하는) 정수형 난수를 만들고 싶으면 high=10+1 처럼 '+1'을 해주면 됩니다.

# np.random.randint : Discrete uniform distribution, yielding integers

In [43]: np.random.seed(100)

In [44]: rand_int = np.random.randint(low=0, high=10+1, size=100)

In [45]: rand_int

Out[45]:

array([ 8, 8, 3, 7, 7, 0, 10, 4, 2, 5, 2, 2, 2, 1, 0, 8, 4,
       10, 0, 9, 6, 2, 4, 1, 5, 3, 4, 4, 3, 7, 1, 1, 7, 7,
        0, 2, 9, 9, 3, 2, 5, 8, 1, 0, 7, 6, 2, 0, 8, 2, 5,
       10, 1, 8, 10, 1, 5, 4, 2, 8, 3, 5, 0, 9, 10, 3, 6, 3,
        4, 10, 7, 6, 3, 9, 0, 4, 4, 5, 7, 6, 6, 2, 10, 4, 2,
        7, 1, 10, 6, 10, 6, 0, 10, 7, 2, 3, 5, 4, 2, 4])

# histogram

In [46]: import matplotlib.pyplot as plt

In [47]: count, bins, ignored = plt.hist(rand_int, bins=10, normed=True)

(2-4) F-분포로 부터 무작위 표본 추출 : np.random.f(dfnum, dfden, size)

(Random sampling from F-distribution)

자유도1이 '5', 자유도2가 '10'인 F-분포로 부터 100개의 난수를 만들고, 히스토그램을 그려서 분포를 확인해보겠습니다.

# (2-4) F-분포 (F-distribution)으로부터 난수 생성
# Draw samples from an F-distribution (Fisher distribution)
# numpy.random.f(dfnum, dfden, size=None)
# dfnum : degrees of freedom in numerator
# dfden : degrees of freedom in denominator

In [48]: np.random.seed(100)

In [49]: rand_f = np.random.f(dfnum=5, dfden=10, size=100)

In [50]: rand_f

Out[50]:

array([ 0.17509245, 1.34830314, 0.7250835 , 0.55013536, 1.49183341,
        1.19802261, 1.24949706, 0.70015548, 0.71890936, 0.37020715,
        4.70371284, 0.86726338, 5.12146941, 0.12848202, 0.68237285,
        0.79663258, 1.36935299, 1.08005188, 0.99311831, 0.15607878,
        3.7778542 , 2.35609305, 0.16850985, 0.98599364, 1.12567067,
        3.21579679, 0.87982087, 0.38319493, 0.96834789, 1.00428004,
        1.65589171, 1.2581278 , 1.71881244, 0.11251552, 1.65949951,
        1.15809569, 1.33210756, 0.37989215, 0.252446 , 1.22409406,
        1.86571485, 0.42345727, 3.52740557, 1.32989807, 2.0095314 ,
        1.20016474, 3.5067706 , 0.67232354, 2.79268109, 0.38115844,
        1.3978449 , 0.7089553 , 2.12685211, 0.73462708, 2.03686026,
        0.50287078, 0.31183315, 1.66994305, 5.36906534, 1.55708073,
        2.66826698, 1.31701804, 0.66126086, 0.19123589, 0.58223398,
        0.41897952, 2.17842598, 0.98481411, 0.46953552, 0.99266818,
        0.4463218 , 0.43809118, 0.37791494, 2.46417893, 0.91230902,
        0.50247167, 1.0960922 , 0.61328846, 2.07107491, 0.65524443,
        4.00311763, 1.61430287, 0.16159395, 2.42851301, 1.38124899,
        0.33750889, 1.93776135, 1.55612023, 0.59284748, 0.56785228,
        1.09259657, 1.22611626, 0.0744978 , 0.10373193, 1.95616674,
        2.29130443, 0.62968361, 0.67477008, 0.60981642, 0.58408102])

# histogram

In [51]: import matplotlib.pyplot as plt

In [52]: count, bins, ignored = plt.hist(rand_f, bins=20, normed=True)

(2-5) 카이제곱분포로 부터 무작위 표본 추출 : np.random.chisquare(df, size)

(Random sampling from Chisq-distribution)

자유도(degrees of freedom)가 '2'인 카이제곱분포로부터 100개의 난수를 생성하고, 히스토그램을 그려서 분포를 확인해보겠습니다.

# (2-5) 카이제곱분포 (Chisq-distribution)로 부터 난수 생성
# Draw samples from a chi-square distribution
# np.random.chisquare(df, size=None)
# df : Number of degrees of freedom

In [53]: np.random.seed(100)

In [54]: rand_chisq = np.random.chisquare(df=2, size=100)

In [55]: rand_chisq

Out[55]:

array([ 1.56791674e+00,   6.52483769e-01,   1.10509323e+00,
         3.72577379e+00,   9.46005029e-03,   2.59236110e-01,
         2.22187032e+00,   3.49570821e+00,   2.94001314e-01,
         1.71177147e+00,   4.43873095e+00,   4.69425742e-01,
         4.09939940e-01,   2.29423517e-01,   4.96147209e-01,
         7.69095282e+00,   3.33925872e+00,   3.77341773e-01,
         3.38808346e+00,   6.40613699e-01,   1.13022638e+00,
         5.62781567e+00,   3.40364791e+00,   8.19283487e-01,
         3.85739072e-01,   9.33081811e-01,   1.14094971e-02,
         5.81844910e-01,   3.17596456e+00,   3.07450508e-02,
         1.82680669e+00,   1.85169520e+00,   2.22193172e-01,
         9.62350625e-01,   7.43158820e-02,   4.42204683e+00,
         7.91831906e+00,   1.23627383e-01,   4.42450081e+00,
         1.72030053e+00,   2.71331337e+00,   1.98949905e+00,
         1.74379278e+00,   4.13018033e-02,   4.71511953e-01,
         1.57353105e+00,   2.93167254e+00,   5.77218949e-01,
         6.73452471e-01,   3.82643217e+00,   7.37827846e+00,
         4.32309651e+00,   8.91036809e-01,   1.82688432e+00,
         8.76376262e-01,   8.31607381e-01,   3.92226832e-01,
         5.42815006e-01,   9.17994841e-02,   1.40813894e+00,
         9.44019133e-01,   1.79692816e+00,   1.98819039e+00,
         3.07702183e-01,   5.43139774e+00,   5.85166185e+00,
         1.84409785e+00,   9.81282349e-01,   9.02561614e-01,
         4.57179904e-01,   6.48042320e-01,   5.66147763e-01,
         3.81372088e-01,   6.79897935e+00,   6.29369648e+00,
         1.82247546e+00,   2.62832513e+00,   8.32198571e-01,
         1.93144279e-01,   1.24537005e+00,   1.42139617e+00,
         1.85239984e-01,   1.50170184e+00,   9.69653206e+00,
         1.00517243e+00,   8.17731091e-01,   3.27413768e+00,
         2.80768685e+00,   7.51035408e-01,   2.01044435e+00,
         1.55481738e+00,   7.04210092e-01,   2.34838981e-01,
         7.49795080e-01,   1.22121505e+00,   2.15139414e+00,
         5.86749873e-01,   2.04942998e+00,   4.46596145e-01,
         2.14369617e+00])

# histogram

In [56]: import matplotlib.pyplot as plt

In [57]: count, bins, ignored = plt.hist(rand_chisq, bins=20, normed=True)

많은 도움 되었기를 바랍니다.

pandas DataFrame 에 대한 무작위 표본 추출 방법은 https://rfriend.tistory.com/602 를 참고하세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) (2)	2017.02.04
[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays) (0)	2017.01.30
[Python NumPy] 다차원 배열 ndarray 만들기 (0)	2017.01.14
[Python pandas] Series, DataFrame 행, 열 생성(creation), 선택(selection, slicing, indexing), 삭제(drop, delete) (0)	2017.01.03
[Python] 데이터 정렬 (sort, arrange) : DataFrame.sort_values(), sorted(), list.sort() (2)	2016.12.31

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'size'에 해당되는 글 2건

[Python] 4개 변수로 점의 크기와 색깔을 다르게 산점도 그리기 (Scatter plot with 4 variables, different size & color) (3/4)

'Python 분석과 프로그래밍 > Python 그래프_시각화' 카테고리의 다른 글

[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바