'Python 분석과 프로그래밍' 카테고리의 글 목록 (24 Page)

[Python NumPy] 정수 배열을 사용해서 다차원 배열 인덱싱 하기 : Fancy Indexing

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 3. 1. 20:21

지난번 포스팅에서는

( 예: arr[0:2, 5:8] )

( 예: arr[bool_cond == 'A'] )

에 대해서 알아보았습니다.

이번 포스팅에서는 배열 Indexing의 세번째 방법으로서 정수 배열을 indexer로 사용해서 다차원 배열로 부터 Indexing하는 방법, Fancy Indexing에 대해 알아보겠습니다. 앞서의 배열 Indexing & Slicing 에서는 view 가 만들어졌었지만, Fancy Indexing은 copy를 만듭니다.

(세가지 방법이 서로 비슷비슷해서 뭐가 다른거지? 하는 혼란이 있을 것입니다. 이번 포스팅을 살펴본 후에 앞서의 2가지도 마저 비교해보면서 살펴볼 것을 권합니다)

Fancy Indexing의 경우 앞서 소개했었던 배열 Indexing 하는 부분에 '정수 배열(integer array)'이 들어갑니다.

(1) 특정 순서로 다차원 배열의 행을 Fancy Indexing 하기

(2) 특정 순서로 다차원 배열의 행과 열을 Fancy Indexing 하기

로 나누어서 예를 들어 설명하겠습니다.

(1) 특정 순서로 다차원 배열의 행(row)을 Fancy Indexing 하기

'axis 0' (row 기준)의 위에서 부터 아래 방향(from the first to the end)으로 '1'과 '2' 위치의 행(row) 전체를 fancy indexing 해보겠습니다. 대괄호(square brackets) 2개를 사용해서 a[[1, 2]] 처럼 입력해주면 됩니다.

In [1]: import numpy as np

In [2]: a = np.arange(15).reshape(5, 3)

In [3]: a

Out[3]:

array([[ 0, 1, 2],
        [ 3, 4, 5],
      [ 6, 7, 8],
      [ 9, 10, 11],
      [12, 13, 14]])

# (1) selecting a subset of the rows
# (1-1) selecting a subset of the rows by fancy indexing using integer arrays

In [4]: a[[1, 2]]

Out[4]:

array([[3, 4, 5],
[6, 7, 8]])

'axis 0' (row 기준)의 아래에서 위 방향(from the end to the first) 으로 fancy indexing을 하고 싶으면 '-' (minus) 부호를 붙여주면 됩니다. 단, 이때는 indexing이 '0'부터 시작하는 것이 아니라 '1'부터 시작합니다. (헷갈리지요? -_-;)

# (1-2) selecting a subet of the rows from the end by using negative indices

In [5]: a[[-1, -2]]

Out[5]:

array([[12, 13, 14],
[ 9, 10, 11]])

(2) 특정 순서로 다차원 배열의 행(row)과 열(column)을 Fancy Indexing 하기

두가지 방법이 있습니다.

(2-1) 첫번째 방법은 (1-1)에서 소개했던 방법으로 특정 순서로 행(row)을 fancy indexing 합니다. 그런 후에 전체 행을 ':'로 선택하고, 특정 칼럼을 순서대로 배열을 사용해서 indexing을 한번 더 해주는 겁니다.

(2-2) 두번째 방법은 np.ix_ 함수를 사용해서 배열1 로 특정 행(row)을 지정, 배열2 로 특정 열(column)을 지정해주는 것입니다.

'0', '2', '4'의 행(row)을 indexing하고, '0', '2' 열(column)을 indexing 해오는 두가지 방법을 순서대로 예를 들어보겠습니다.

# (2) selecting a square region of the rows and columns
# (2-1) selecting subset by passing multiple index arrays first, and then selecting columns

In [6]: a[[0, 2, 4]][:, [0, 2]]

Out[6]:

array([[ 0, 2],
[ 6, 8],
[12, 14]])

# (2-2) by using np.ix_ function : converting two 1D integer arrays to an indexer

In [7]: a[np.ix_([0, 2, 4], [0, 2])]

Out[7]:

array([[ 0, 2],
[ 6, 8],
[12, 14]])

(3) Fancy Indexing은 view가 아니라 copy 를 생성

다시 한번 강조하지만, Fancy Indexing 을 하게 되면 view가 아니라 copy 가 생성이 됩니다. 따라서 Fancy Indexing 후의 copy 된 배열에 변화를 주어도 원본 배열에는 아무런 영향이 없습니다. (앞서의 포스팅에서 다루었던 배열 indexing & slicing에서는 거꾸로 copy가 아니라 view를 생성했었습니다. 그러다보니 view에 변화를 가하면 원본 배열에도 동일한 변화가 가해졌었습니다)

# (3) fancy indexing creates a copy, not a view

In [8]: a

Out[8]:

array([[ 0, 1, 2],
       [ 3, 4, 5],
       [ 6, 7, 8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [9]: a_copy = a[np.ix_([0, 2, 4], [0, 2])]

In [10]: a_copy

Out[10]:

array([[ 0, 2],
[ 6, 8],
[12, 14]])

# change to the copy of the ndarray

In [11]: a_copy[0, :] = 100

In [12]: a_copy

Out[12]:

array([[100, 100],
[ 6, 8],
[ 12, 14]])

# no change to the original array

In [13]: a

Out[13]:

array([[ 0, 1, 2],
       [ 3, 4, 5],
       [ 6, 7, 8],
       [ 9, 10, 11],
       [12, 13, 14]])

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 범용 함수 (universal functions) : (1-2) 단일 배열 unary ufuncs : 합(sum), 누적합(cum_sum), 곱(product), 누적곱(cum_prod), 차분(difference), gradient 범용함수 (2)	2017.03.09
[Python NumPy] 범용 함수 (universal functions) : (1-1) 단일 배열 unary ufuncs : 올림 혹은 내림 (rounding) (0)	2017.03.05
[Python Numpy] Boolean 조건문으로 배열 인덱싱 (Boolean Indexing) (2)	2017.02.27
[Python NumPy] 배열의 일부분 선택하기, indexing and slicing an ndarray (0)	2017.02.25
[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1) (0)	2017.02.25

Posted by Rfriend

,

[Python Numpy] Boolean 조건문으로 배열 인덱싱 (Boolean Indexing)

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 2. 27. 00:09

지난번 포스팅에서 Python NumPy의 배열 Indexing & Slicing 에 대해서 소개를 했었습니다.

이번 포스팅에서는 배열 Indexing & Slicing 에 조금 더 살을 붙여볼 텐데요, 조건문을 가지고 True, False 로 Indexing을 해보는 "Boolean Indexing" 입니다.

데이터 전처리를 하다보면 == (equal), != (not equal), & (and), | (or) 등의 조건문 연산자를 활용해서 indexing & slicing 해야할 때가 종종 있으므로 이번 포스팅도 씀씀이가 솔솔할겁니다.

2차원 배열을 가지고 간단한 예를 들어서 설명해보겠습니다.

(1) 특정 조건을 만족하는 배열의 모든 열을 선별하기 : ==

# making a 2D array

In [1]: import numpy as np

In [2]: arr = np.arange(20).reshape(5, 4)

In [3]: arr

Out[3]:

array([[ 0, 1, 2, 3],
        [ 4, 5, 6, 7],
        [ 8, 9, 10, 11],
        [12, 13, 14, 15],
      [16, 17, 18, 19]])

Boolean 조건값으로 사용할 배열을 하나 만들어 보겠습니다.

Boolean Indexing 할 때 원래 배열의 축(axis 0) 의 원소 개수가 Boolean 배열의 원소의 개수와 같아야 합니다. 이번 예에서는 arr 배열의 shape 이 (5, 4) 이므로 행(row)의 '5'에 맞추어서 '5'개의 원소(element)로 구성된 'axis_ABC'라는 배열을 만들었습니다.

# an array of 'axis_ABC' with duplicates

In [4]: axis_ABC = np.array(['A', 'A', 'B', 'C', 'C'])

In [5]: axis_ABC

Out[5]:

array(['A', 'A', 'B', 'C', 'C'],
dtype='<U1')

axis_ABC == 'A' 인 행 전체를 배열 'arr'로 부터 indexing 해보겠습니다. 전체 열을 같이 indexing 해오라고 지시하기 위해 콤마(comma) ','와 콜론 (colon) ':' 을 사용해도 되고, 생략해도 됩니다.

# selecting all the rows which axis_ABC equals(==) 'A'

In [6]: axis_ABC == 'A'

Out[6]: array([ True, True, False, False, False], dtype=bool)

In [7]: arr[axis_ABC == 'A']

Out[7]:

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

# the same result with the above

In [8]: arr[axis_ABC == 'A', :]

Out[8]:

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

옆길로 조금 빠져보자면요, Boolean 조건으로 행(row) 전체를 indexing 하고, 콤마(comma) ',' 와 함께 콜론(colon) ':'으로 slicing을 하거나 정수(integer)로 열(column)을 indexing 하는 예를 들어보겠습니다.

# slicing with colon ':'

In [9]: arr[axis_ABC == 'A', :2]

Out[9]:

array([[0, 1],
[4, 5]])

# indexing with interger => result to low dimension array

In [10]: arr[axis_ABC == 'A', 2]

Out[10]: array([2, 6])

(2) 특정 조건을 만족하지 않는 배열의 모든 열을 선별하기 : !=, ~(==)

위의 예와는 정반대로 'A'가 아닌 전체 열(row)을 indexing 해보겠습니다.

'!='와 '~(==)'의 두가지 방법이 있습니다.

In [3]: arr

Out[3]:

array([[ 0, 1, 2, 3],
        [ 4, 5, 6, 7],
        [ 8, 9, 10, 11],
        [12, 13, 14, 15],
      [16, 17, 18, 19]])

In [4]: axis_ABC = np.array(['A', 'A', 'B', 'C', 'C'])

# selecting all the rows except 'A' : != 'A'

In [11]: arr[axis_ABC != 'A']

Out[11]:

array([[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])

# selecting all the rows except 'A' : ~(axisABC == 'A')

In [12]: arr[~(axis_ABC == 'A')]

Out[12]:

array([[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])

(3) 복수의 조건으로 배열의 특정 열 선별하기 : & (and), | (or)

Boolean 조건문으로 두 개 이상의 복수개를 사용할 때가 있겠지요? & (and), | (or) operator로 복수의 조건문을 엮어서 indexing 하는 예를 들어보겠습니다.

In [3]: arr

Out[3]:

array([[ 0, 1, 2, 3],
        [ 4, 5, 6, 7],
        [ 8, 9, 10, 11],
        [12, 13, 14, 15],
      [16, 17, 18, 19]])

In [4]: axis_ABC = np.array(['A', 'A', 'B', 'C', 'C'])

# indexing by using mutiple boolean conditions, & (and), | (or)

In [13]: arr[(axis_ABC == 'A') | (axis_ABC == 'B')]

Out[13]:

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

In [14]: arr[(axis_ABC != 'A') & (axis_ABC != 'B')]

Out[14]:

array([[12, 13, 14, 15],
[16, 17, 18, 19]])

이때 유의할 게 있는데요, '&', '|' 대신에 'and' 혹은 'or' syntax로 직접 입력하면 'ValueError' 메시지가 뜹니다.

# ValueError : and, or syntax

In [15]: arr[(axis_ABC != 'A') and (axis_ABC != 'B')]

Traceback (most recent call last):

File "<ipython-input-15-ca8b3629eb98>", line 1, in <module>

arr[(axis_ABC != 'A') and (axis_ABC != 'B')]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

(4) Booean 조건에 해당하는 배열 Indexing에 스칼라 값을 할당하기

In [3]: arr

Out[3]:

array([[ 0, 1, 2, 3],
        [ 4, 5, 6, 7],
        [ 8, 9, 10, 11],
        [12, 13, 14, 15],
      [16, 17, 18, 19]])

In [4]: axis_ABC = np.array(['A', 'A', 'B', 'C', 'C'])

# assigning salcar values with boolean arrays

In [16]: arr[axis_ABC == 'A'] = 100

In [17]: arr

Out[17]:

array([[100, 100, 100, 100],
        [100, 100, 100, 100],
        [ 8,   9, 10, 11],
        [ 12, 13, 14, 15],
        [ 16, 17, 18, 19]])

In [18]: arr[arr >= 100] = 0

In [19]: arr

Out[19]:

array([[ 0, 0, 0, 0],
        [ 0, 0, 0, 0],
      [ 8, 9, 10, 11],
      [12, 13, 14, 15],
        [16, 17, 18, 19]])

In [20]: arr[(arr >= 8) & (arr <= 15)] = 10

In [21]: arr

Out[21]:

array([[ 0, 0, 0, 0],
        [ 0, 0, 0, 0],
        [10, 10, 10, 10],
        [10, 10, 10, 10],
        [16, 17, 18, 19]])

다음번 포스팅에서는 특정 순서로 행과 열을 선택하는 Fancy Indexing 에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 범용 함수 (universal functions) : (1-1) 단일 배열 unary ufuncs : 올림 혹은 내림 (rounding) (0)	2017.03.05
[Python NumPy] 정수 배열을 사용해서 다차원 배열 인덱싱 하기 : Fancy Indexing (0)	2017.03.01
[Python NumPy] 배열의 일부분 선택하기, indexing and slicing an ndarray (0)	2017.02.25
[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1) (0)	2017.02.25
[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile (0)	2017.02.19

Posted by Rfriend

,

[Python NumPy] 배열의 일부분 선택하기, indexing and slicing an ndarray

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 2. 25. 23:48

예전에 Python Pandas 의 DataFrame 전처리에 대해서 연재할 때 DataFrame의 행 또는 열 데이터 선택해서 가져오기 (Indexing and selection of DataFrame objects) 하는 방법에 대해서 소개했던 적이 있습니다.

이번 포스팅에서는 Python NumPy 배열의 일부분, 부분집합을 선택 (Indexing and slicing an ndarray) 하는 방법을 알아보겠습니다.

다차원 배열 다룰 때 indexing, slicing 은 마치 밥 먹을 때 수시로 김치에 젖가락이 가듯이 그냥 일상적으로 사용하곤 하므로 정확하게 알아둘 필요가 있습니다.

1차원 배열, 2차원 배열, 3차원 배열의 순서대로 indexing 하는 방법을 간단한 예를 들어서 설명해보겠습니다.

(1-1) Indexing a subset of 1D array : a[from : to]

#%% NumPy array Indexing and Slicing

In [1]: import numpy as np

In [2]: a = np.arange(10)

In [3]: a

Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# (1-1) Indexing a subset of 1D array

In [4]: a[0]

Out[4]: 0

In [5]: a[0:5]

Out[5]: array([0, 1, 2, 3, 4])

(1-2) array slices are views of the original array and are not a copy

Python NumPy의 배열 indexing, slicing에서 유의해야할 것이 있습니다. 배열을 indexing 해서 얻은 객체는 복사(copy)가 된 독립된 객체가 아니며, 단지 원래 배열의 view 일 뿐이라는 점입니다. 따라서 view를 새로운 값으로 변경시키면 원래의 배열의 값도 변경이 됩니다.

아래 예에서 원래의 배열 'b'의 0, 1, 2 위치의 원소에 빨간색으로 밑줄을 그어놨습니다. 배열 b에서 0, 1, 2, 3, 4 위치의 값을 indexing 해서 만든 b_idx 배열은 원래 배열 b의 view 일 뿐이며, copy가 아닙니다. b_idx 에서 뭔가 변화가 일어나면 원래의 배열 b에도 b_idx의 변화가 반영됩니다. b_idx의 0, 1, 2 위치의 원소 값을 '10'으로 바꿔치기 했더니 원래의 배열 b의 0, 1, 2 위치의 값도 '10'으로 바뀌어있음을 알 수 있습니다.

In [6]: b = np.arange(10)

In [7]: b

Out[7]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [8]: b_idx = b[0:5]

In [9]: b_idx

Out[9]: array([0, 1, 2, 3, 4])

# Assigning(Broadcasting) a scalar to a slice of 1D array

In [10]: b_idx[0:3] = 10

In [11]: b_idx

Out[11]: array([10, 10, 10, 3, 4])

# compare this 'b' with the original array 'b' above, it's different!!!

In [12]: b

Out[12]: array([10, 10, 10, 3, 4, 5, 6, 7, 8, 9])

R 사용자라면 위 상황을 보고 아주 당황스러울 것입니다. R에서는 indexing을 하면 무조건 copy 가 되고, 원래의 배열과 indexing한 후의 배열은 전혀 별개의, 독립된 객체로 간주가 되거든요. 그래서 저 같은 경우는 크기가 매우 큰 배열의 경우 아주 작은 일부분만 indexing을 해 온후에, 작은 크기의 indexing한 배열을 가지고 데이터 조작 test를 이렇게 저렇게 다양하게 해 본 후에, 제대로 작동하는 걸 확인한 최종 R script를 원래의 크기가 큰 배열에 적용하곤 했거든요. indexing 했던 배열에 제가 무슨 짓을 하던 그건 원래의 배열에 영향이 없었던 거지요.

반면에, Python NumPy의 배열에서는 indexing해온 배열에 제가 무슨 짓을 하면요, 그게 원래의 배열에도 반영이 되는 줄을 처음에 몰랐었습니다. 그러다 보니 indexing했던 배열에 이런, 저런 test 해보고 나서 원래 배열이 변질(?)이 된 것 보고 '이게 뭐지? 왜 이런 거지? 무슨 일이 벌어진 거지? 이거 혹시 버그?'... 뭐, 이랬습니다. 한참을 이랬습니다. 에휴... -_-;

Python NumPy가 배열 indexing 할 때 copy가 view를 반환하는데는 이유가 있겠지요? 그건 성능(performance)을 높이고 메모리(memory) 이슈를 피하기 위해서 입니다.

(1-3) indexing한 배열을 복사하기 : arr[0:5].copy()

배열을 indexing 한 후에 얻은 배열을 복사하고 싶으면, 그래서 원래의 배열과 독립된 배열로 처리하고 싶으면 copy() method 를 사용하면 됩니다. 아래 예는 c[0:5].copy() 만 다르고, 나머지는 위의 예와 동일한데요, 제일 마지막의 [19] 번 결과를 보면 원래의 배열 'c'가 indexing된 배열 'c_idx_copy'가 중간에 바뀐것에 영향을 안받고 원래의 값을 그대로 유지하고 있습니다. 빨간색으로 밑줄 그어놓은 부분을 위의 [12]번 결과와 아래의 [19]번 결과를 비교해보시기 바랍니다.

In [13]: c = np.arange(10)

In [14]: c

Out[14]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]: c_idx_copy = c[0:5].copy()

In [16]: c_idx_copy

Out[16]: array([0, 1, 2, 3, 4])

In [17]: c_idx_copy[0:3] = 10

In [18]: c_idx_copy

Out[18]: array([10, 10, 10, 3, 4])

In [19]: c

Out[19]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

이제 2차원 배열로 넘어가 볼까요?

(2-1) Indexing and Slicing 2D array with comma ',' : d[0:3, 1:3]

행과 열 기준으로 위치를 지정해주어서 indexing을 하며, 연속된 위치 값의 경우 '0:2' 처럼 콜론(colon) ':' 를 사용하면 편리합니다. 행과 열의 구분은 콤마(comma) ',' 를 사용합니다.

아래 몇 가지 유형별로 예시를 들어놓았으니 indexing 방법과 결과를 살펴보시기 바랍니다.

In [20]: d = np.arange(20).reshape(4, 5)

In [21]: d

Out[21]:

array([[ 0, 1, 2, 3, 4],
        [ 5, 6, 7, 8, 9],
        [10, 11, 12, 13, 14],
      [15, 16, 17, 18, 19]])

# indexing a row of 2D array => returning a 1D array

In [22]: d[0]

Out[22]: array([0, 1, 2, 3, 4])

# indexing mutiple rows in a row of 2D array : use colon ':'

In [23]: d[0:2]

Out[23]:

array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

# indexing an element of of 2D array : use comma ','

In [24]: d[0, 4]

Out[24]: 4

In [25]: d[0:3, 1:3]

Out[25]:

array([[ 1, 2],
[ 6, 7],
[11, 12]])

(2-2) Indexing and Slicing 2D array with square bracket '[ ][ ]' : d[0:3][1:3]

NumPy 배열을 행과 열을 기준으로 indexing 할 때 콤마(comma) ','를 사용하지 않고 아래 처럼 대괄호(square bracket)을 두개 '[ ][ ]' 처럼 사용할 수도 있습니다. 다만, indexing 하는 순서와 결과가 위에서 예를 들었던 콤마 ','를 사용하는 것과 조금 다르므로 주의하기 바랍니다.

대괄호 두개 '[ ][ ]'는 첫번째 대괄호 '[ ]'에서 indexing을 먼저 하고 나서, 그 결과를 가져다가 두번째 대괄호 '[ ]'에서 한번 더 indexing을 하게 됩니다.

In [26]: d

Out[26]:

array([[ 0, 1, 2, 3, 4],
        [ 5, 6, 7, 8, 9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]])

In [27]: d[0][4] # the same result with the above [24]

Out[27]: 4

In [28]: d[0:3][1:3] # a different result from the above [25], working sequencially

Out[28]:

array([[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])

(2-3) array slices are views of the original array and are not a copy

위의 (1-2) 에서 NumPy 배열을 indexing해서 얻은 배열은 원래 배열의 copy가 아니라 view 라고 했었습니다. 2차원 배열도 똑같습니다. 복습하는 차원에서 2차원 배열에서 indexing한 view 'd_idx_0'에 일부 변화를 줘보겠습니다. 그랬더니 원래 배열 'd'에도 view 'd_dix_0'의 변화가 반영이 되었음을 알 수 있습니다.

# assigning a scalar value to the subset(1D array) of 2D array

In [29]: d

Out[29]:

array([[ 0, 1, 2, 3, 4],
        [ 5, 6, 7, 8, 9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]])

In [30]: d_idx_0 = d[0]

In [31]: d_idx_0

Out[31]: array([0, 1, 2, 3, 4])

In [32]: d_idx_0[0:2] = 100

In [33]: d_idx_0

Out[33]: array([100, 100, 2, 3, 4])

# once again, subset of array by indexing is not a copy, but a view!!!

In [34]: d

Out[34]:

array([[100, 100,   2,   3,   4],
      [ 5,   6,   7,   8,   9],
        [ 10, 11, 12, 13, 14],
        [ 15, 16, 17, 18, 19]])

3차원 배열 indexing도 마저 알아보겠습니다.

(3-1) Indexing and Slicing of 3D array : e[0, 0, 0:3]

방법은 위의 1차원 배열, 2차원 배열 indexing과 동일합니다. 3차원이 되면 2차원 배열이 층을 이루어서 겹겹이 쌓여서 나타나게 되어서 좀 헷갈릴 수 있는데요, 아래에 몇 개 indexing 유형별로 예를 들었으니 참고하시기 바랍니다.

층을 먼저 선택(2차원 배열 덩어리 중에서 먼저 indexing) 하는 것이 추가가 된 것이구요, 그 다음의 indexing은 위의 2차원 배열 indexing 방법과 동일합니다.

# (3-1) Indexing and Slicing 3D array

In [35]: e = np.arange(24).reshape(2, 3, 4)

In [36]: e

Out[36]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

# indexing the first array with shape(3, 4) from the 3D array with shape(2, 3, 4)

In [37]: e_idx_0 = e[0]

In [38]: e_idx_0

Out[38]:

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

# indexing the first row of the first array with shape(3, 4) from the 3D array with shape(2, 3, 4)

In [39]: e_idx_0_0 = e[0, 0]

In [40]: e_idx_0_0

Out[40]: array([0, 1, 2, 3])

# indexing the '0, 1, 2' elements from the first row of the first array with shape(3, 4)

# from the 3D array with shape(2, 3, 4)

In [41]: e_idx_0_0_0_2 = e[0, 0, 0:3]

In [42]: e_idx_0_0_0_2

Out[42]: array([0, 1, 2])

(3-2) 축 하나를 통째로 가져오기( indexing the entire axis by using colon ':')

콜론(colon) ':' 을 사용해서 행 축(row, axis 0)을 통째로 slicing 해올 수 있습니다.

In [44]: e

Out[44]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

      [[12, 13, 14, 15],
       [16, 17, 18, 19],
        [20, 21, 22, 23]]])

# indexing the entire axis by using colon ':'

In [43]: e[0, :, 0:3]

Out[43]:

array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]])

Python NumPy 배열 Indexing과 R의 행렬 Indexing 비교

마지막으로, R을 사용하면서 Python을 사용하는 분이라면 indexing이 처음에 혼란스러울 것입니다.

저는 R indexing이 훨씬 더 쉬운데요, 여러분은 어떤지 모르겠습니다. Python indexing은 '0'부터 시작하는 것도 어색하고, '0:3'이라고 했을 때 '3' 위치는 포함하지 않는 것도 어색합니다. Python indexing 할 때는 항상 긴장하고, 실수 하지 않기 위해 머리 써야 해서 피곤합니다. -_-;;;

(Java 나 C 프로그래밍 하셨던 분이라면 Python indexing이 더 쉽고 R indexing이 혼란스럽다고 하겠지요? ㅎㅎ)

아래에 Python NumPy 2차원 배열 indexing과 똑같은 결과를 얻을 수 있는 R의 행렬 indexing을 비교해보았습니다.

Python NumPy : Indexing of 2D array

R : Indexing of Matrix

In [44]: d = np.arange(20).reshape(4, 5)

In [45]: d

Out[45]:

array([[ 0, 1, 2, 3, 4],
        [ 5, 6, 7, 8, 9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]])

In [46]: d[0:3, 1:3]

Out[46]:

array([[ 1, 2],
[ 6, 7],
[11, 12]])

> R_matrix_4_5 <- matrix(0:19, 
+                        nrow=4, 
+                        byrow=T)
>
> R_matrix_4_5
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    1    2    3    4
[2,]    5    6    7    8    9
[3,]   10   11   12   13   14
[4,]   15   16   17   18   19
> 
> R_matrix_4_5[1:3, 2:3]
     [,1] [,2]
[1,]    1    2
[2,]    6    7
[3,]   11   12

다음번 포스팅에서는

- Boolean Indexing

- Fancy Indexing

에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 정수 배열을 사용해서 다차원 배열 인덱싱 하기 : Fancy Indexing (0)	2017.03.01
[Python Numpy] Boolean 조건문으로 배열 인덱싱 (Boolean Indexing) (2)	2017.02.27
[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1) (0)	2017.02.25
[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile (0)	2017.02.19
[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting (2)	2017.02.12

Posted by Rfriend

,

[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1)

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 2. 25. 19:23

지난번 포스팅에서는 다차원 행렬 ndarray에 축을 추가하는 방법으로 arr(:, np.newaxis, :), np.tile() 을 소개했었습니다.

이번 포스팅에서는 행렬의 행과 열을 바꾸기, 행렬의 축을 바꾸는 방법을 알아보겠습니다. 선형대수에서 보면 '전치행렬(transpose matrix)'이라는 특수한 형태의 행렬이 있는데요, 이번 포스팅이 바로 그겁니다. 행렬의 내적(inner product) 구할 때 aT*a 처럼 전치행렬과 원래 행렬을 곱할 때 전치행렬(aT)를 씁니다.

Python의 NumPy 로 부터 행렬 전치를 위해

- a.T attribute

- np.transpose(a) method

- np.swapaxes(a, 0, 1) method

의 3가지 방법을 사용할 수 있습니다.

a.T attrbute, np.transpose() method, np.swapaxes() method 각각에 대해 2차원 행렬을 전치하는 간단한 예를 들어보겠습니다.

(1-1) Transposing 2 D array : a.T attribute

In [1]: import numpy as np

In [2]: a = np.arange(15).reshape(3, 5)

In [3]: a

Out[3]:

array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])

# (1-1) transposing 2D array : T attribute

In [4]: a.T

Out[4]:

array([[ 0, 5, 10],
        [ 1, 6, 11],
      [ 2, 7, 12],
        [ 3, 8, 13],
       [ 4, 9, 14]])

(1-2) Transposing 2D array : np.transpose() method

In [5]: a

Out[5]:

array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])

# (1-2) transpose method : numpy.transpose(a, axes=None)

In [6]: np.transpose(a)

Out[6]:

array([[ 0, 5, 10],
        [ 1, 6, 11],
        [ 2, 7, 12],
        [ 3, 8, 13],
      [ 4, 9, 14]])

(1-3) Transposing 2D array : np.swapaxes() method

In [7]: a

Out[7]:

array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])

# (1-3) swapaxes method : numpy.swapaxes(a, axis1, axis2)

In [8]: np.swapaxes(a, 0, 1)

Out[8]:

array([[ 0, 5, 10],
        [ 1, 6, 11],
      [ 2, 7, 12],
        [ 3, 8, 13],
        [ 4, 9, 14]])

이쯤에서 NumPy의 np.dot() 으로 행렬 내적 (inner product, aT*a) 계산해볼까요?

In [9]: np.dot(a.T, a)

Out[9]:

array([[125, 140, 155, 170, 185],
        [140, 158, 176, 194, 212],
      [155, 176, 197, 218, 239],
        [170, 194, 218, 242, 266],
        [185, 212, 239, 266, 293]])

In [10]: np.dot(np.transpose(a), a)

Out[10]:

array([[125, 140, 155, 170, 185],
        [140, 158, 176, 194, 212],
      [155, 176, 197, 218, 239],
        [170, 194, 218, 242, 266],
        [185, 212, 239, 266, 293]])

In [11]: np.dot(np.swapaxes(a, 0, 1), a)

Out[11]:

array([[125, 140, 155, 170, 185],
        [140, 158, 176, 194, 212],
      [155, 176, 197, 218, 239],
        [170, 194, 218, 242, 266],
        [185, 212, 239, 266, 293]])

2차원 행렬로 전치에 대한 맛을 봤으니, 이제 3차원 행렬의 축을 바꿔보는 것으로 난이도를 높여보겠습니다. np.transpose() 와 np.swapaxes() method는 전치시키려고 하는 축을 입력해줘야 하는데요, 조금 조심해서 사용해야 합니다.

(2-1) Transposing 3D array : a.T attribute

In [12]: b = np.arange(24).reshape(2, 3, 4)

In [13]: b

Out[13]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [14]: b.T

Out[14]:

array([[[ 0, 12],
[ 4, 16],
[ 8, 20]],

        [[ 1, 13],
         [ 5, 17],
         [ 9, 21]],

      [[ 2, 14],
       [ 6, 18],
         [10, 22]],

        [[ 3, 15],
         [ 7, 19],
         [11, 23]]])

In [15]: b.T.shape

Out[15]: (4, 3, 2)

(2-2) Transposing 3D array : np.transpose() method

np.transpose() method는 축을 바꾸고 싶은 위치, 순서를 분석가가 마음대로 지정할 수 있다는 측면에서 T attribute 보다 자유도가 높습니다. (처음엔 좀 헷갈리고 이해가 잘 안가는 면도 있지만요)

# shape (2, 3, 4)

In [16]: b

Out[16]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

      [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

# shape(2, 3, 4) => shape (4, 3, 2)

In [17]: np.transpose(b)

Out[17]:

array([[[ 0, 12],
[ 4, 16],
[ 8, 20]],

      [[ 1, 13],
         [ 5, 17],
         [ 9, 21]],

        [[ 2, 14],
       [ 6, 18],
         [10, 22]],

        [[ 3, 15],
         [ 7, 19],
         [11, 23]]])

# shape(2, 3, 4) => shape (4, 3, 2)

In [18]: np.transpose(b, (2, 1, 0))

Out[18]:

array([[[ 0, 12],
[ 4, 16],
[ 8, 20]],

        [[ 1, 13],
         [ 5, 17],
         [ 9, 21]],

      [[ 2, 14],
       [ 6, 18],
        [10, 22]],

        [[ 3, 15],
         [ 7, 19],
       [11, 23]]])

In [19]: b.shape

Out[19]: (2, 3, 4)

In [20]: np.transpose(b).shape

Out[20]: (4, 3, 2)

(2-3) Transposing 3D array : np.swapaxes() method

In [21]: b

Out[21]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [22]: b.shape

Out[22]: (2, 3, 4)

# shape(2, 3, 4) => shape(4, 3, 2)

In [23]: np.swapaxes(b, 0, 2)

Out[23]:

array([[[ 0, 12],
[ 4, 16],
[ 8, 20]],

        [[ 1, 13],
         [ 5, 17],
       [ 9, 21]],

        [[ 2, 14],
       [ 6, 18],
         [10, 22]],

        [[ 3, 15],
         [ 7, 19],
         [11, 23]]])

In [24]: np.swapaxes(b, 0, 2).shape

Out[24]: (4, 3, 2)

np.transpose() 와 np.swapaxes() method를 사용해서 전치시키려는 축의 순서를 위의 예시와는 조금 다르게 바꿔서 해보겠습니다. 축(axes)의 순서를 바꿔서 입력해주면 됩니다. (말로 설명하기 좀 어려운데요, 아래 예와 위의 예를 유심히 살펴보시기 바랍니다)

In [25]: b

Out[25]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [26]: b.shape

Out[26]: (2, 3, 4)

# shape(2, 3, 4) => shape(3, 2, 4)

In [27]: np.transpose(b, (1, 0, 2)).shape

Out[27]: (3, 2, 4)

In [28]: np.transpose(b, (1, 0, 2))

Out[28]:

ararray([[[ 0, 1, 2, 3],
[12, 13, 14, 15]],

[[ 4, 5, 6, 7],
[16, 17, 18, 19]],

[[ 8, 9, 10, 11],
[20, 21, 22, 23]]])

위의 [28]번 np.transpose(b, (1, 0, 2)) 와 똑같은 결과를 얻을 수 있는 방법으로 np.swapaxes(b, 1, 0)을 사용하면 됩니다.

In [29]: b

Out[29]:

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [30]: b.shape

Out[30]: (2, 3, 4)

In [31]: np.swapaxes(b, 1, 0).shape

Out[31]: (3, 2, 4)

In [32]: np.swapaxes(b, 1, 0)

Out[32]:

ararray([[[ 0, 1, 2, 3],
[12, 13, 14, 15]],

[[ 4, 5, 6, 7],
[16, 17, 18, 19]],

[[ 8, 9, 10, 11],
[20, 21, 22, 23]]])

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] Boolean 조건문으로 배열 인덱싱 (Boolean Indexing) (2)	2017.02.27
[Python NumPy] 배열의 일부분 선택하기, indexing and slicing an ndarray (0)	2017.02.25
[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile (0)	2017.02.19
[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting (2)	2017.02.12
[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) (2)	2017.02.04

Posted by Rfriend

,

[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 2. 19. 23:10

지난번 포스팅에서 Broadcasting을 다루어보았는데요, 이번 포스팅에서는 Broadcasting 과 관련이 있는 NumPy Array에 새로운 축 추가하는 2가지 방법을 소개해보겠습니다.

하나는 np.newaxis attribute 이고, 또 하나는 np.tile() method 인데요, 사용법이나 결과가 조금 다르니 데이터 처리 용도나 목적에 맞게 골라서 사용하시면 되겠습니다.

(1) indexing으로 길이가 1인 새로운 축을 추가하기

: arr(:, np.newaxis, :)

(2) 배열을 반복하면서 새로운 축을 추가하기

: np.tile(arr, reps)

(1) indexing으로 길이가 1인 새로운 축을 추가하기 : arr(:, np.newaxis, :)

먼저 NumPy 모듈을 불러어고, 예제 array 를 만들어보겠습니다.

In [1]: import numpy as np

In [2]: a = np.array([1., 2., 3., 4.])

In [3]: a

Out[3]: array([ 1., 2., 3., 4.])

In [4]: a.shape

Out[4]: (4,)

np.newaxis attribute를 사용해서 길이가 1인 새로운 축을 하나 추가해보겠습니다.

# (1) np.newaxis : adding and inserting new axis with length 1 by indexing

In [5]: a_4_1 = a[:, np.newaxis]

In [6]: a_4_1

Out[6]:

array([[ 1.],
       [ 2.],
       [ 3.],
       [ 4.]])

In [7]: a_4_1.shape

Out[7]: (4, 1)

이번에는 shape(3, 4) 인 2차원의 배열에 길이가 1인 새로운 축을 추가해서 3차원 배열을 만들어보겠습니다. ':'를 사용해서 기존 배열의 값을 indexing 하구요, np.newaxis attribute로 새로운 축을 추가하게 됩니다.

In [8]: b = np.arange(12).reshape(3, 4)

In [9]: b

Out[9]:

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

In [10]: b.shape

Out[10]: (3, 4)

# shape(3, 4, 1)

In [11]: b_3_4_1 = b[ :, :, np.newaxis]

In [12]: b_3_4_1

Out[12]:

array([[[ 0],
         [ 1],
         [ 2],
         [ 3]],

        [[ 4],
         [ 5],
       [ 6],
        [ 7]],

        [[ 8],
         [ 9],
       [10],
        [11]]])

In [13]: b_3_4_1.shape

Out[13]: (3, 4, 1)

# shape(3, 1, 4)

In [14]: b_3_1_4 = b[ :, np.newaxis, : ]

In [15]: b_3_1_4

Out[15]:

array([[[ 0, 1, 2, 3]],

[[ 4, 5, 6, 7]],

[[ 8, 9, 10, 11]]])

In [16]: b_3_1_4.shape

Out[16]: (3, 1, 4)

(2) 배열을 반복하면서 새로운 축을 추가하기 : np.tile(arr, reps)

np.tile(arr, reps) method 는 'arr' 에는 배열을, 'reps'에는 반복하고자 하는 회수를 넣어줍니다.

'reps'에는 숫자를 넣을 수도 있고, 배열을 넣을 수도 있습니다.

먼저 원소 4개짜리 배열을 가지고

- 같은 차원으로 '2번' 반복하기 : reps = 2

- '2차원'으로 '2번' 반복하기 : reps = (2, 2)

를 차례대로 해보겠습니다.

In [17]: A = np.array([0., 1., 2., 3.])

In [18]: A

Out[18]: array([ 0., 1., 2., 3.])

In [19]: A.shape

Out[19]: (4,)

# reps = 2

In [20]: A_8 = np.tile(A, 2)

In [21]: A_8

Out[21]: array([ 0., 1., 2., 3., 0., 1., 2., 3.])

In [22]: A_8.shape

Out[22]: (8,)

# reps = (2, 2)

In [23]: A_2_8 = np.tile(A, (2, 2))

In [24]: A_2_8

Out[24]:

array([[ 0., 1., 2., 3., 0., 1., 2., 3.],
[ 0., 1., 2., 3., 0., 1., 2., 3.]])

In [25]: A_2_8.shape

Out[25]: (2, 8)

다음으로 2차원 배열에 np.tile() method를 사용해서

- 배열을 반복하기

- 배열을 반복하면서 차원을 하나 더 추가하기

를 해보겠습니다.

In [26]: B = np.arange(8).reshape(2, 4)

In [27]: B

Out[27]:

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

In [28]: B.shape

Out[28]: (2, 4)

In [29]: B_2_8 = np.tile(B, 2)

In [30]: B_2_8

Out[30]:

array([[0, 1, 2, 3, 0, 1, 2, 3],
[4, 5, 6, 7, 4, 5, 6, 7]])

In [31]: B_2_8.shape

Out[31]: (2, 8)

In [32]: B_4_4 = np.tile(B, (2, 1))

In [33]: B_4_4

Out[33]:

array([[0, 1, 2, 3],
        [4, 5, 6, 7],
        [0, 1, 2, 3],
       [4, 5, 6, 7]])

In [34]: B_4_4.shape

Out[34]: (4, 4)

In [35]: B_3_4_8 = np.tile(B, (3, 2, 2))

In [36]: B_3_4_8

Out[36]:

array([[[0, 1, 2, 3, 0, 1, 2, 3],
         [4, 5, 6, 7, 4, 5, 6, 7],
       [0, 1, 2, 3, 0, 1, 2, 3],
         [4, 5, 6, 7, 4, 5, 6, 7]],

        [[0, 1, 2, 3, 0, 1, 2, 3],
         [4, 5, 6, 7, 4, 5, 6, 7],
         [0, 1, 2, 3, 0, 1, 2, 3],
       [4, 5, 6, 7, 4, 5, 6, 7]],

        [[0, 1, 2, 3, 0, 1, 2, 3],
         [4, 5, 6, 7, 4, 5, 6, 7],
       [0, 1, 2, 3, 0, 1, 2, 3],
        [4, 5, 6, 7, 4, 5, 6, 7]]])

In [37]: B_3_4_8.shape

Out[37]: (3, 4, 8)

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 배열의 일부분 선택하기, indexing and slicing an ndarray (0)	2017.02.25
[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1) (0)	2017.02.25
[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting (2)	2017.02.12
[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) (2)	2017.02.04
[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays) (0)	2017.01.30

Posted by Rfriend

,

[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 2. 12. 23:16

지난번 포스팅에서는 같은 차원 크기의 배열 간 산술연산인 Vectorization 에 대해서 소개하였습니다.

이번 포스팅에서는 차원의 크기가 서로 다른 배열 간 산술연산 시의 Broadcasting 에 대해서 알아보겠습니다. (Braodcasting도 Vectorization 의 일부분 입니다. 사실, 지난번 vectorization에서 Scalar와의 연산 시 element-wise 연산 시 이미 Broadcasting을 맛보았었습니다. ㅎㅎ)

Broadcasting이 서로 다른 모양, 크기의 배열 간 연산이다 보니 좀 헷갈릴 수도 있는데요, 알아두면 매우 편리하고 또 빠른 연산으로 유용합니다. 이해하기 쉽도록 Broadcasting 되는 모습을 이미지(점선 & 화살표)로 표현을 병행했습니다.

배열의 차원(Dimension)과 축(Axis) 별로 4가지 유형의 Broadcasting 을 차례대로 소개해보겠습니다.

1) Broadcasting over axis 1 with a Scalar

2) Broadcasting over axis 0 with a 1-D array

3) Broadcasting over axis 1 with a 2-D array

4) Broadcasting over axis 0 with a 3-D array

순서대로 예를 들어 살펴보겠습니다.

1) Broadcasting over axis 1 with a Scalar

먼저, 간단한 Scalar 부터 시작해보시지요.

# (1-1) Arithmetic operations between array and scalars
# : the scalar are broadcasted along the same dimensions of ndarray

In [1]: import numpy as np

In [2]: a_ar = np.array([1., 2., 3., 4.])

In [3]: a_ar.shape

Out[3]: (4,)

In [4]: a_ar + 1

Out[4]: array([ 2., 3., 4., 5.])

배열 뿐만 아니라 Pandas의 DataFrame 도 Scalar 산술 연산 시에 Broadcasting 이 적용됩니다. 간단한 예를 들어볼겠요.

# (1-2) Arithmetic operations between DataFrame and scalars
# : the scalar are broadcasted along the same dimensions of DataFrame

In [5]: import pandas as pd

In [6]: a_df = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [5, 6, 7, 8]})

In [7]: a_df

Out[7]:

x1 x2

0 1 5

1 2 6

2 3 7

3 4 8

In [8]: a_df + 1

Out[8]:

x1 x2

0 2 6

1 3 7

2 4 8

3 5 9

자, 이제 차원을 하나 늘려볼까요?

2) Broadcasting over axis 0 with a 1-D array

세로 방향(over axis 0)으로 row를 복사해가면서 Braodcasting을 하는 예입니다.

## (2) Broadcasting using a 1-D array
# Arithmetic operations between 2-D array and 1-D array
# that is the same length as the row-length

In [9]: b = np.arange(12).reshape((4, 3))

In [10]: b.shape

Out[10]: (4, 3)

In [11]: b

Out[11]:

array([[ 0, 1, 2],

[ 3, 4, 5],

[ 6, 7, 8],

[ 9, 10, 11]])

In [12]: c = np.array([0, 1, 2])

In [13]: c.shape

Out[13]: (3,)

In [14]: c

Out[14]: array([0, 1, 2])

# adding c (1-D array) row-wise to b (2-D array)

In [15]: b + c

Out[15]:

array([[ 0, 2, 4],

[ 3, 5, 7],

[ 6, 8, 10],

[ 9, 11, 13]])

배열의 차원 크기, 모양이 다르다고 해서 Broadcasting 이 아무때나 되는 것은 아닙니다. Broadcasting을 시키려면 기준 축에 있는 원소의 크기(개수)가 서로 같아야지 짝을 맞추어서 확산(broadcasting, propagating)을 할 수 있습니다. 말로 설명하기가 좀 어렵습니다. ^^; 아래에 Broadcasting이 안되고 ValueError가 난 사례를 예로 들어보겠습니다.

ValueError: operands could not be broadcast together with shapes (4,3) (4,)

## Shape mismatches
# ValueError: operands could not be broadcast together with shapes (4,3) (4,)

In [11]: b

Out[11]:

array([[ 0, 1, 2],

[ 3, 4, 5],

[ 6, 7, 8],

[ 9, 10, 11]])

In [16]: d = np.array([0, 1, 2, 3])

In [17]: b + d

Traceback (most recent call last):

File "<ipython-input-17-8c4237e65878>", line 1, in <module>

b + d

ValueError: operands could not be broadcast together with shapes (4,3) (4,)

3) Broadcasting over axis 1 with a 2-D array

가로 방향(over axis 1)으로 column을 복사해가면서 broadcasting하는 예입니다.

## (3) Broadcasting over axis 1 of a 2-D array

In [18]: b = np.arange(12).reshape((4, 3))

In [19]: b.shape

Out[19]: (4, 3)

In [20]: b

Out[20]:

array([[ 0, 1, 2],

[ 3, 4, 5],

[ 6, 7, 8],

[ 9, 10, 11]])

In [21]: e = np.array([0, 1, 2, 3]).reshape(4, 1)

In [22]: e.shape

Out[22]: (4, 1)

In [23]: e

Out[23]:

array([[0],

[1],

[2],

[3]])

# adding e (2-D array) column-wise to b (2-D array)

In [24]: b + e

Out[24]:

array([[ 0, 1, 2],

[ 4, 5, 6],

[ 8, 9, 10],

[12, 13, 14]])

자, 이제 3차원으로 넘어가보겠습니다. 머리가 슬슬 아파오지요? ^^;

4차원부터는 그림으로 예시를 들기가 애매해서 3차원까지만 할께요.

4) Broadcasting over axis 0 with a 3-D array

3-D 배열에서 앞뒤 방향(over axis 0) 으로 2-D 배열을 복사해가면서 Broadcasting 하는 예제입니다.

## (4) Broadcasting over axis 0 of a 3-D array
# 3-D array

In [25]: f = np.arange(24).reshape((2,4,3))

In [26]: f

Out[26]:

array([[[ 0, 1, 2],

[ 3, 4, 5],

[ 6, 7, 8],

[ 9, 10, 11]],

[[12, 13, 14],

[15, 16, 17],

[18, 19, 20],

[21, 22, 23]]])

# 2-D array

In [27]: g = np.ones((4,3))

In [28]: g

Out[28]:

array([[ 1., 1., 1.],

[ 1., 1., 1.],

[ 1., 1., 1.]])

# Broadcasting over axis 0 of a 3-D array : 3-D array + 2-D array

In [29]: f + g

Out[29]:

array([[[ 1., 2., 3.],

[ 4., 5., 6.],

[ 7., 8., 9.],

[ 10., 11., 12.]],

[[ 13., 14., 15.],

[ 16., 17., 18.],

[ 19., 20., 21.],

[ 22., 23., 24.]]])

* 서로 다른 차원을 가진 두 배열의 산술연산 시 repeat(n, axis) 메소드를 통해 차원을 맞추어주는 방법은 https://rfriend.tistory.com/549 를 참고하세요.

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1) (0)	2017.02.25
[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile (0)	2017.02.19
[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) (2)	2017.02.04
[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays) (0)	2017.01.30
[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation) (4)	2017.01.21

Posted by Rfriend

,

[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars)

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 2. 4. 23:49

Python을 활용한 기계학습이나 딥러닝 공부하다보면 NumPy 연산이 기본으로 깔려있습니다. NumPy를 알고 있으면 공부하기가 편할 것이고, NumPy를 모르면 암호처럼 보일 거예요.

이번 포스팅에서는 Python NumPy의 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) 에 대해서 알아보겠습니다. 별로 어렵지는 않습니다.

크기가 같은 두 개의 숫자형 배열에 대해서 산술 연산(arithmetic operations)을 할 때 'for loops'를 사용하지 않고도 NumPy ndarray를 사용하면 Vectorization 으로 매우 빠르게 batch 연산을 수행할 수 있습니다.

만약 연산을 하는 두 배열의 차원(dimension, shape)이 다르다면 NumPy는 Broadcasting을 해서 연산을 해주는데요, 이건 다음번 포스팅에서 별도로 소개하겠습니다.

[ Operations between NumPy Arrarys and Scalars ]

먼저 NumPy 모듈을 불러오고 예제 array 를 만들어보겠습니다.

#%% Numerical Operations between Arrays and Scalars
# vectorization

## importing modules

In [1]: import numpy as np

## making an nparray

In [2]: x = np.array([1., 1., 2., 2.])

In [3]: y = np.array([1., 2., 3., 4.])

In [4]: x

Out[4]: array([ 1., 1., 2., 2.])

In [5]: y

Out[5]: array([ 1., 2., 3., 4.])

(1) 배열과 스칼라의 산술 연산 (Arithmetic operations with an array and scalar)

산술연산 종류 : 덧셈 (addition, +), 뺄셈 (subtraction, -), 곱셈 (multiplication, *), 나눗셈 (division, /), 몫 (floor division, //), 나머지 (modulus, %), 배수 (exponent, **)

연산할 때 스칼라가 배열의 각 원소별로 모두 돌아가면서 연산의 대상으로 적용됨 (한글로 풀어쓰려니 표현이 참... -_-;) (propagating the value to each element).

## (1) Arithmetic operations with scalars, propagating the value to each element
# (1-1) addition

In [6]: y + 1

Out[6]: array([ 2., 3., 4., 5.])

# (1-2) subtraction

In [7]: y - 1

Out[7]: array([ 0., 1., 2., 3.])

# (1-3) multiplication

In [8]: y*2

Out[8]: array([ 2., 4., 6., 8.])

# (1-4) division

In [9]: y/2

Out[9]: array([ 0.5, 1. , 1.5, 2. ])

# (1-5) floor division

In [10]: y//2

Out[10]: array([ 0., 1., 1., 2.])

# (1-6) modulus : returns remainder after division

In [11]: y%2

Out[11]: array([ 1., 0., 1., 0.])

# (1-7) exponent

In [12]: y**2

Out[12]: array([ 1., 4., 9., 16.])

In [13]: 2**y

Out[13]: array([ 2., 4., 8., 16.])

NumPy 와 Pure Python 속도 비교

위에서 NumPy (vectorization) 가 Pure Python (for loops) 보다 많이 빠르다고 했는데요, 한번 벤치마킹 테스트를 해보았습니다. 똑같은 연산 (배열에 스칼라 1 더하기)을 해봤는데요, NumPy가 Pure Python보다 62.4배 빠르네요. 이거 아주 중요해요. 보통 Python 공부하면서 아주 짧고 간단한 예제를 가지고 실습하다보면 NumPy와 Pure Python의 'for loops'의 속도 차이를 못느낄 수 있습니다. 그런데 만약 백만, 천만 row를 가진 big data에 for loop 잘못 썼다가는 몇 시간이 걸려도 연산이 끝나지 않던게, NumPy 잘 사용하면 수 분 안에 끝낼 수 있다는 뜻이거든요.

## comparison process time between NumPy and Pure Python
# NumPy

In [14]: a = np.arange(1000000)

In [15]: a

Out[15]: array([ 0, 1, 2, ..., 999997, 999998, 999999])

In [16]: %timeit a + 1

1000 loops, best of 3: 1.45 ms per loop

# Pure Python

In [17]: b = range(1000000)

In [18]: b

Out[18]: range(0, 1000000)

In [19]: %timeit [i+1 for i in b]

10 loops, best of 3: 90.5 ms per loop

# how fast does NumPy than Pure Python?

In [20]: 90.5/1.45

Out[20]: 62.41379310344828

(2) 같은 크기 배열 간 산술 연산

(Arithmetic elementwise operstions between equal-size arrays)

NumPy에서의 x*y 곱은 선형대수에서 쓰는 행렬곱이 아니며, 그냥 같은 위치에 있는 원소들 간의 곱(element-wise multiplication) 이라는 점 유의하세요.

## (2) Arithmetic elementwise operations between equal-size arrays

In [21]: x

Out[21]: array([ 1., 1., 2., 2.])

In [22]: y

Out[22]: array([ 1., 2., 3., 4.])

# (2-1) addition

In [23]: x + y

Out[23]: array([ 2., 3., 5., 6.])

# (2-2) substraction

In [24]: x - y

Out[24]: array([ 0., -1., -1., -2.])

# (2-3) Array multiplication is not matrix multiplication

In [25]: x*y

Out[25]: array([ 1., 2., 6., 8.])

# (2-4) division

In [26]: y/x

Out[26]: array([ 1. , 2. , 1.5, 2. ])

# (2-5) floor division : 몫

In [27]: y//x

Out[27]: array([ 1., 2., 1., 2.])

# (2-6) modulus : returns remainder after division, 나머지

In [28]: y%x

Out[28]: array([ 0., 0., 1., 0.])

# (2-7) power

In [29]: y**x

Out[29]: array([ 1., 2., 9., 16.])

(3) 배열 간 비교 연산 (Comparison operations between equal-size arrays)

원소 단위 비교 연산(element-wise comparison) : np.equal(x, y), np.not_equal(x, y), np.greater(x, y), np.greater_equal(x, y), np.less(x, y), np.less_equal(x, y)

원소 단위로 비교 연산을 만족하면 True 반환, 비교 연산을 만족하지 않으면 False 반환함
배열 단위 비교 연산(array-wise comparson) : np.array_equal(x, y)

## (3) Comparison operations : returns boolean array

In [30]: x

Out[30]: array([ 1., 1., 2., 2.])

In [31]: y

Out[31]: array([ 1., 2., 3., 4.])

# element-wise comparisons
# (3-1) np.equal(x, y) : x==y

In [32]: np.equal(x, y)

Out[32]: array([ True, False, False, False], dtype=bool)

# (3-2) np.not_equal(x, y) : x != y, x <> y

In [33]: np.not_equal(x, y)

Out[33]: array([False, True, True, True], dtype=bool)

# (3-3) np.greater(x, y) : x > y

In [34]: np.greater(x, y)

Out[34]: array([False, False, False, False], dtype=bool)

# (3-4) np.greater_equal(x, y) : x >= y

In [35]: np.greater_equal(x, y)

Out[35]: array([ True, False, False, False], dtype=bool)

# (3-5) np.less(x, y) : x < y

In [36]: np.less(x, y)

Out[36]: array([False, True, True, True], dtype=bool)

# (3-6) np.less_equal(x, y) : x <= y

In [37]: np.less_equal(x, y)

Out[37]: array([ True, True, True, True], dtype=bool)

# (3-7) array-wise comparisons

In [38]: np.array_equal(x, y)

Out[38]: False

In [39]: z = y.copy()

In [40]: z

Out[40]: array([ 1., 2., 3., 4.])

In [41]: np.array_equal(y, z)

Out[41]: True

(4) 배열 간 할당 연산 (Assignment operations between equal-size arrays)

할당 연산 : Add AND (m += n), Subtract AND (m -= n), Multiply AND (m *= n), Divide AND (m /= n), Floor Division (m //= n), Modulus AND (m %= n), Exponent AND (m **= n)

##----------------
## (4) Assignment operattions

In [42]: m = np.array([1., 1., 2., 2.])

In [43]: n = np.array([1., 2., 3., 4.])

# (4-1) Add AND : +=

In [44]: m += n # equivalent to m = m + n

In [45]: m

Out[45]: array([ 2., 3., 5., 6.])

# m += n is equivalent to m = m + n

In [46]: m = np.array([1., 1., 2., 2.])

In [47]: m = m + n

In [48]: m

Out[48]: array([ 2., 3., 5., 6.])

# (4-2) Subtract AND : -=

In [49]: m = np.array([1., 1., 2., 2.])

In [50]: m -= n # equivalent to m = m - n

In [51]: m

Out[51]: array([ 0., -1., -1., -2.])

# (4-3) Multiply AND : *=

In [52]: m = np.array([1., 1., 2., 2.])

In [53]: m *= n # equivalent to m = m*n

In [54]: m

Out[54]: array([ 1., 2., 6., 8.])

# (4-4) Divide AND : /=

In [55]: m = np.array([1., 1., 2., 2.])

In [56]: m /= n # equivalent to m = m/n

In [57]: m

Out[57]: array([ 1. , 0.5 , 0.66666667, 0.5 ])

# (4-5) Floor Division : //=

In [58]: m = np.array([1., 1., 2., 2.])

In [59]: m //= n # equivalent to m = m//n

In [60]: m

Out[60]: array([ 1., 0., 0., 0.])

# (4-6) Modulus AND : %=

In [61]: m = np.array([1., 1., 2., 2.])

In [62]: m %= n # equivalent to m = m % n

In [63]: m

Out[63]: array([ 0., 1., 2., 2.])

# (4-7) Exponent AND : **=

In [64]: m = np.array([1., 1., 2., 2.])

In [65]: m **= n # equivalent to m = m**n

In [66]: m

Out[66]: array([ 1., 1., 8., 16.])

(5) 배열 간 논리 연산 (Logical operations between equal-size arrays)

논리 연산
- np.logical_and(a, b) : 두 배열의 원소가 모두 '0'이 아니면 True 반환
- np.logical_or(a, b) : 두 배열의 원소 중 한개라도 '0'이 아니면 True 반환
- np.logical_xor(a, b) : 두 배열의 원소가 서로 같지 않으면 True를 반환

##----------------
## (5) Logical operators

In [67]: a = np.array([1, 1, 0, 0], dtype=bool)

In [68]: b = np.array([1, 0, 1, 0], dtype=bool)

# (5-1) np.logcal_and : (a and b) is true
# if all of the two operands are non-zero then condition becomes true

# equivalent to infix operators : &

In [69]: np.logical_and(a, b)

Out[69]: array([ True, False, False, False], dtype=bool)

# (5-2) np.logical_or : (a or b) is true
# if any of the operands are non-zero then contition becoems true

# equivalent to infix operators : |

In [70]: np.logical_or(a, b)

Out[70]: array([ True, True, True, False], dtype=bool)

# (5-3) np.logical_xor : Not (a and b) is false
# used to reverse the logical state of its operand

# equivalent to infix operators : ^

In [71]: np.logical_xor(a, b)

Out[71]: array([False, True, True, False], dtype=bool)

(6) 소속 여부 판단 연산 (Membership operators)

소속 여부 판단 연산 : in, not in
배열 (혹은 리스트)에 특정 객체가 들어있으면 True를 반환, 안들어 있으면 False를 반환

# (6) Membership operators

In [72]: p = "P"

In [73]: q = np.array(["P", "Q"])

In [74]: r = np.array(["R", "S"])

# (6-1) in

In [75]: p in q

Out[75]: True

In [76]: p in r

Out[76]: False

# (6-2) not in

In [77]: p not in q

Out[77]: False

In [78]: p not in r

Out[78]: True

배열의 차원, 크기가 다를 때

=> ValueError: operands could not be broadcast together with shapes (4,) (5,)

##--------------
## Shape mismatches

In [79]: x4 = np.array([1., 1., 2., 2.])

In [80]: x5 = np.arange(5)

# ValueError: operands could not be broadcast together with shapes (4,) (5,)

In [81]: x4 + x5

Traceback (most recent call last):

File "<ipython-input-82-5e88af36d6f1>", line 1, in <module>

x4 + x5

ValueError: operands could not be broadcast together with shapes (4,) (5,)

배열의 차원, 크기가 다를 때 ValueError가 났습니다. 그리고 'broadcast'를 할 수 없다(could not be broadcast together...)는 메시지가 나왔습니다.

다음번 포스팅에서는 Broadcasting 에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~♡'를 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile (0)	2017.02.19
[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting (2)	2017.02.12
[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays) (0)	2017.01.30
[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation) (4)	2017.01.21
[Python NumPy] 다차원 배열 ndarray 만들기 (0)	2017.01.14

Posted by Rfriend

,

[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays)

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 1. 30. 16:56

이번 포스팅에서는 Python의 NumPy 모듈을 사용해서

- 데이터 형태 지정 (assign data type) :

- 데이터 형태 확인 (check data type)

- 데이터 형태 변경 (convert data type)

하는 방법을 소개하겠습니다.

다양한 데이터 형태를 자유자재로 다룰 수 있다는 점이 NumPy의 주요 강점 중의 하나이며, 데이터 전처리 단계에서 데이터 형태를 지정하거나 변경(숫자형을 문자형으로, 문자형을 숫자형으로..) 하는 일이 종종 생기므로 기본기로서 꼭 익혀두어야 하는 내용입니다. 어렵지 않으니 내용 한번 훓어보시고, 실습 한번 따라해보세요.

(1) 데이터 형태의 종류 (Data Types)

데이터 형태(Data Type, dtype)는 크게 숫자형(nuemric)과 문자형(string)으로 나누며, 숫자형으로 된 데이터 형태에는 bool 자료형(booleans, bool), 정수형 (integers, int), 부호 없는 정수형 (unsigned integers , uint), 부동소수형 (floating point, float), 복소수형 (complex) 의 5가지가 있습니다.

NumPy 에서 각 데이터 형태를 나타내는 표기법은 아래와 같으며, 데이터 형태 표기 Type 옆에 써있는 숫자 (8, 16, 31, 64, 128 등)는 단일 값을 메모리에 저장하는데 필요한 bits 수(1 byte = 8 bits)를 의미합니다.

아래 표의 Type code는 Type을 줄여서 쓴 표기로서, Type과 똑같이 결과를 반환합니다.

구분

Type

Type Code

Example

숫자형

(numeric)

bool형

(booleans)

bool

?

[True, True, False, False]

정수형

(integers)

int8

int16

int32

int64

i1

i2

i4

i8

[-2, -1, 0, 1, 2, 3]

부호없는 (양수) 정수형

(unsigned integers)

uint8

uint16

uint32

uint64

u1

u2

u4

u8

[2, 1, 0, 1, 2, 3]

부동소수형

(floating points)

float16

float32

float64

f2

f4

f8

[-2.0, -1.3, 0.0, 1.9, 2.2, 3.6]

복수수형(실수 + 허수)

(complex)

complex64

complex128

c8

c16

(1 + 2j)

문자형

(character)

문자형

(string)

string_

S

['Seoul', 'Busan', 'Incheon']

복소수형(complex)는 고등학교 때 배우셨을텐데요, 과학자나 수학자가 아니라면 회사에서 분석업무할 때는 거의 안쓸거 같습니다. 참고로, 수학시간에는 (1+2i) 처럼 허수를 'i'로 표기하면서 배웠을텐데요, Python에서는 (1+2j) 처럼 허수를 'j'로 표기합니다.

NumPy에서 ndarray 만들 때 데이터 형태 지정해주는 3가지 방법을 소개합니다. 저는 첫번째 방법이 눈에 잘 들어오고 이해가 잘 되어서 (비록 좀 길더라도) 주로 사용합니다.

(2) NumPy 데이터 형태 지정해주기

(2-1) np.array([xx, xx], dtype=np.Type)

# importing numpy module

In [1]: import numpy as np

# (2-1) making array with data type of float64: dtype=np.float64

In [2]: x_float64 = np.array([1.4, 2.6, 3.0, 4.9, 5.32], dtype=np.float64)

# checking data type : dtype method

In [3]: x_float64.dtype

Out[3]: dtype('float64')

In [4]: x_float64

Out[4]: array([ 1.4 , 2.6 , 3. , 4.9 , 5.32])

데이터 형태 확인은 object.dtype 을 사용합니다.

(2-2) np.array([xx, xx], dtype=np.'Type Code')

# (2-2) making array with shorthand type code string for float64 : dtype='f8'

In [5]: x_float64_1 = np.array([1.4, 2.6, 3.0, 4.9, 5.32], dtype='f8')

In [6]: x_float64_1.dtype

Out[6]: dtype('float64')

In [7]: x_float64_1

Out[7]: array([ 1.4 , 2.6 , 3. , 4.9 , 5.32])

(2-3) np.Type([xx, xx])

# (2-3) making array with data type of float64 : np.folat64()

In [8]: x_float64_2 = np.float64([1.4, 2.6, 3.0, 4.9, 5.32])

In [9]: x_float64_2.dtype

Out[9]: dtype('float64')

In [10]: x_float64_2

Out[10]: array([ 1.4 , 2.6 , 3. , 4.9 , 5.32])

다음으로 데이터 변환하는 방법에 대해서 알아보겠습니다.

(3) 데이터 형태 변환 (converting data type) : object.astype(np.Type)

(3-1) float64를 int64로 변환하기 : x_float64.astype(np.int64)

float64의 소수점 부분이 int64로 변환 이후에는 짤려나갔습니다(truncated).

astype 은 새로운 배열을 복사(copy) 한다는 점 참고하세요.

In [11]: x_float64 # original data

Out[11]: array([ 1.4 , 2.6 , 3. , 4.9 , 5.32])

# object.astype(np.int64)

In [12]: x_int64 = x_float64.astype(np.int64) # the decimal parts are truncated

In [13]: x_int64.dtype

Out[13]: dtype('int64')

In [14]: x_int64

Out[14]: array([1, 2, 3, 4, 5], dtype=int64)

(3-2) float64를 int64로 변환하기 : np.int64(x_float64)

np.int64(object) 는 (1)번에서 소개했던 int64 만드는 방법 중의 하나였습니다.

In [15]: x_int64_2 = np.int64(x_float64)

In [16]: x_int64_2.dtype

Out[16]: dtype('int64')

In [17]: x_int64_2

Out[17]: array([1, 2, 3, 4, 5], dtype=int64)

(3-3) 부동소수형(float64)를 문자열(string)로 변환하기 : x_float64.astype(np.string_)

'np.string_' 의 제일 마지막에 '_' 있다는 점 빼먹지 마세요. 실수로 '_' 를 추가한게 아니라 원래 있는거 맞습니다.

# (3-3) from float64 to string : astype(np.string_)
# fixed-length string type (1 byte per character)

In [18]: x_string = x_float64.astype(np.string_)

In [19]: x_string.dtype

Out[19]: dtype('S32')

In [20]: x_string

Out[20]:

array([b'1.4', b'2.6', b'3.0', b'4.9', b'5.32'],

dtype='|S32')

(3-4) 문자열(string)을 부동소수형(float64)로 변환하기 : x_string.astype(np.float64)

# (2-4) from string to float64 : astype(np.float64)

In [21]: x_from_string_to_float64 = x_string.astype(np.float64)

In [22]: x_from_string_to_float64

Out[22]: array([ 1.4 , 2.6 , 3. , 4.9 , 5.32])

(4) Python의 int와 NumPy의 int64 비교

(difference between Python's native int and NumPy's int64)

In [23]: x_py = 12345

In [24]: x_np = np.int64(12345)

아래 비교의 왼쪽은 Python native int에 사용 가능한 연산자(operators)와 methods, attributes 들입니다. 그리고 오른쪽은 NumPy int64에서 사용할 수 있는 연산자(operators)와 methods, attributes 들입니다. 연산자는 큰 차이가 없지만 methods, attributes에서는 NumPy가 훨씬 많음(Python int는 8개 vs. NumPy int64는 68개, 약 8배 차이)을 알 수 있습니다. 이게 바로 NumPy가 강력하다고 하는 이유 중의 하나입니다.

In [25]: dir(x_py)

Out[25]:

# python's native int operators

['__abs__',

'__add__',

'__and__',

'__bool__',

'__ceil__',

'__class__',

'__delattr__',

'__dir__',

'__divmod__',

'__doc__',

'__eq__',

'__float__',

'__floor__',

'__floordiv__',

'__format__',

'__ge__',

'__getattribute__',

'__getnewargs__',

'__gt__',

'__hash__',

'__index__',

'__init__',

'__int__',

'__invert__',

'__le__',

'__lshift__',

'__lt__',

'__mod__',

'__mul__',

'__ne__',

'__neg__',

'__new__',

'__or__',

'__pos__',

'__pow__',

'__radd__',

'__rand__',

'__rdivmod__',

'__reduce__',

'__reduce_ex__',

'__repr__',

'__rfloordiv__',

'__rlshift__',

'__rmod__',

'__rmul__',

'__ror__',

'__round__',

'__rpow__',

'__rrshift__',

'__rshift__',

'__rsub__',

'__rtruediv__',

'__rxor__',

'__setattr__',

'__sizeof__',

'__str__',

'__sub__',

'__subclasshook__',

'__truediv__',

'__trunc__',

'__xor__',

# python's native int methods and attributes

'bit_length',

'conjugate',

'denominator',

'from_bytes',

'imag',

'numerator',

'real',

'to_bytes']

In [26]: dir(x_np)

Out[26]:

# numpy's int64 operators

['T',

'__abs__',

'__add__',

'__and__',

'__array__',

'__array_interface__',

'__array_priority__',

'__array_struct__',

'__array_wrap__',

'__bool__',

'__class__',

'__copy__',

'__deepcopy__',

'__delattr__',

'__dir__',

'__divmod__',

'__doc__',

'__eq__',

'__float__',

'__floordiv__',

'__format__',

'__ge__',

'__getattribute__',

'__getitem__',

'__gt__',

'__hash__',

'__index__',

'__init__',

'__int__',

'__invert__',

'__le__',

'__lshift__',

'__lt__',

'__mod__',

'__mul__',

'__ne__',

'__neg__',

'__new__',

'__or__',

'__pos__',

'__pow__',

'__radd__',

'__rand__',

'__rdivmod__',

'__reduce__',

'__reduce_ex__',

'__repr__',

'__rfloordiv__',

'__rlshift__',

'__rmod__',

'__rmul__',

'__ror__',

'__round__',

'__rpow__',

'__rrshift__',

'__rshift__',

'__rsub__',

'__rtruediv__',

'__rxor__',

'__setattr__',

'__setstate__',

'__sizeof__',

'__str__',

'__sub__',

'__subclasshook__',

'__truediv__',

'__xor__',

# numpy's int64 methods and attributes

'all',

'any',

'argmax',

'argmin',

'argsort',

'astype',

'base',

'byteswap',

'choose',

'clip',

'compress',

'conj',

'conjugate',

'copy',

'cumprod',

'cumsum',

'data',

'denominator',

'diagonal',

'dtype',

'dump',

'dumps',

'fill',

'flags',

'flat',

'flatten',

'getfield',

'imag',

'item',

'itemset',

'itemsize',

'max',

'mean',

'min',

'nbytes',

'ndim',

'newbyteorder',

'nonzero',

'numerator',

'prod',

'ptp',

'put',

'ravel',

'real',

'repeat',

'reshape',

'resize',

'round',

'searchsorted',

'setfield',

'setflags',

'shape',

'size',

'sort',

'squeeze',

'std',

'strides',

'sum',

'swapaxes',

'take',

'tobytes',

'tofile',

'tolist',

'tostring',

'trace',

'transpose',

'var',

'view']

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting (2)	2017.02.12
[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) (2)	2017.02.04
[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation) (4)	2017.01.21
[Python NumPy] 다차원 배열 ndarray 만들기 (0)	2017.01.14
[Python pandas] Series, DataFrame 행, 열 생성(creation), 선택(selection, slicing, indexing), 삭제(drop, delete) (0)	2017.01.03

Posted by Rfriend

,

[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation)

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 1. 21. 22:39

이번 포스팅에서는 시간과 비용 문제로 전수 조사를 못하므로 표본 조사를 해야 할 때, 기계학습 할 때 데이터셋을 훈련용/검증용/테스트용으로 샘플링 할 때, 또는 다양한 확률 분포로 부터 데이터를 무작위로 생성해서 시뮬레이션(simulation) 할 때 사용할 수 있는 무작위 난수 만들기(generating random numbers, random sampling)에 대해서 알아보겠습니다.

Python NumPy는 매우 빠르고(! 아주 빠름!!) 효율적으로 무작위 샘플을 만들 수 있는 numpy.random 모듈을 제공합니다.

NumPy 를 불러오고, 정규분포(np.random.normal)로 부터 개수가 5개(size=5)인 무작위 샘플을 만들어보겠습니다. 무작위 샘플 추출을 할 때마다 값이 달라짐을 알 수 있습니다.

In [1]: import numpy as np

In [2]: np.random.normal(size=5)

Out[2]: array([-0.02030555, 0.38279633, -1.02369692, 1.48083476, -0.44058273])

In [3]: np.random.normal(size=5) # array with different random numbers

Out[3]: array([ 1.11942454, -1.03486318, 1.69015608, -0.43601241, -1.52195043])

먼저, seed와 size 모수 설정하는 것부터 소개합니다.

seed : 난수 생성 초기값 부여

난수 생성 할 때 마다 값이 달라지는 것이 아니라, 누가, 언제 하든지 간에 똑같은 난수 생성을 원한다면 (즉, 재현가능성, reproducibility) seed 번호를 지정해주면 됩니다.

# seed : setting the seed number for random number generation for reproducibility

In [4]: np.random.seed(seed=100)

In [5]: np.random.normal(size=5)

Out[5]: array([-1.74976547, 0.3426804 , 1.1530358 , -0.25243604, 0.98132079])

# exactly the same with the above random numbers

In [6]: np.random.seed(seed=100)

In [7]: np.random.normal(size=5) # 위의 결과랑 똑같음

Out[7]: array([-1.74976547, 0.3426804 , 1.1530358 , -0.25243604, 0.98132079])

size : 샘플 생성(추출) 개수 및 array shape 설정

다차원의 array 형태로 무작위 샘플을 생성할 수 있다는 것도 NumPy random 모듈의 장점입니다.

# size : int or tuple of ints for setting the shape of nandom number array

In [8]: np.random.normal(size=2)

Out[8]: array([ 0.51421884, 0.22117967])

In [9]: np.random.normal(size=(2, 3))

Out[9]:

array([[-1.07004333, -0.18949583, 0.25500144],

[-0.45802699, 0.43516349, -0.58359505]])

In [10]: np.random.normal(size=(2, 3, 4))

Out[10]:

array([[[ 0.81684707, 0.67272081, -0.10441114, -0.53128038],

[ 1.02973269, -0.43813562, -1.11831825, 1.61898166],

[ 1.54160517, -0.25187914, -0.84243574, 0.18451869]],

[[ 0.9370822 , 0.73100034, 1.36155613, -0.32623806],

[ 0.05567601, 0.22239961, -1.443217 , -0.75635231],

[ 0.81645401, 0.75044476, -0.45594693, 1.18962227]]])

다양한 확률 분포로부터 난수를 생성해보겠습니다. 먼저, 정수를 뽑는 이산형 확률 분포(discrete probability distribution)인 (1-1) 이항분포, (1-2) 초기하분포, (1-3) 포아송분포로 부터 무작위 추출하는 방법을 알아보겠습니다.

각 확률분포에 대한 설명까지 곁들이면 포스팅이 너무 길어지므로 참고할 수 있는 포스팅 링크를 걸어놓는 것으로 갈음합니다.

- 이항분포 (Binomial Distribution) : http://rfriend.tistory.com/99

- 초기하분포 (Hypergeometric distribution) : http://rfriend.tistory.com/100

- 포아송 분포 (Poisson Distribution) : http://rfriend.tistory.com/101

(1-1) 이항분포로 부터 무작위 표본 추출 (Random sampling from Binomial Distribution) : np.random.binomial(n, p, size)

앞(head) 또는 뒤(tail) (n=1) 가 나올 확률이 각 50%(p=0.5)인 동전 던지기를 20번(size=20) 해보았습니다.

# (1) 이산형 확률 분포 (Discrete Probability Distribution)
# (1-1) 이항분포 (Binomial Distribution) : np.random.binomial(n, p, size)
# : 복원 추출 (sampling with replacement)
# : n an integer >= 0 and p is in the interval [0,1]

In [11]: np.random.binomial(n=1, p=0.5, size=20)

Out[11]: array([0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

In [12]: sum(np.random.binomial(n=1, p=0.5, size=100) == 1)/100

Out[12]: 0.46999999999999997

(1-2) 초기하분포에서 무작위 표보 추출 (Random sampling from Hypergeometric distribution) : np.random.hypergeometric(ngood, nbad, nsample, size)

good 이 5개, bad 가 20개인 모집단에서 5개의 샘플을 무작위로 비복원추출(random sampling without replacement) 하는 것을 100번 시뮬레이션 한 후에, 도수분포표를 구해서, 막대그래프로 나타내보겠습니다.

# (1-2) 초기하분포 (Hypergeometric distribution)
# : 비복원 추출(sampling without replacement)
# : np.random.hypergeometric(ngood, nbad, nsample, size=None)

In [13]: np.random.seed(seed=100)

In [14]: rand_hyp = np.random.hypergeometric(ngood=5, nbad=20, nsample=5, size=100)

In [15]: rand_hyp

Out[15]:

array([1, 1, 1, 1, 1, 1, 0, 3, 0, 1, 2, 0, 1, 1, 2, 0, 1, 1, 0, 0, 0, 3, 2,
        0, 0, 0, 1, 3, 1, 1, 0, 0, 1, 0, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2,
        1, 0, 1, 1, 2, 1, 2, 1, 2, 0, 1, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0,
        2, 0, 1, 2, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 2, 1, 1, 0, 0, 2, 1, 1,
        1, 1, 1, 1, 1, 3, 1, 0])

# result table of 100 simulation

In [16]: unique, counts = np.unique(rand_hyp, return_counts=True)

In [17]: np.asarray((unique, counts)).T

Out[17]:

array([[ 0, 27],
        [ 1, 53],
       [ 2, 16],
      [ 3, 4]], dtype=int64)

# bar plot

In [18]: import matplotlib.pyplot as plt

In [19]: plt.bar(unique, counts, width=0.5, color="blue", align='center')

Out[19]: <Container object of 4 artists>

(1-3) 포아송분포로 부터 무작위 표본 추출 : np.random.poisson(lam, size)

(random sampling from Poisson distribution)

일정한 단위 시간, 혹은 공간에서 무작위로 발생하는 사건의 평균 회수인 λ(lambda)가 20인 포아송 분포로 부터 100개의 난수를 만들어보겠습니다. 그 후에 도수를 계산하고, 막대그래프로 분포를 그려보겠습니다.

# (1-3) 포아송 분포 (Poisson Distribution)
# np.random.poisson(lam=1.0, size=None)
# Poisson distribution is the limit of the binomial distribution for large N

In [20]: np.random.seed(seed=100)

In [21]: rand_pois = np.random.poisson(lam=20, size=100)

In [22]: rand_pois

Out[22]:

array([21, 19, 22, 14, 26, 15, 25, 25, 19, 25, 15, 24, 21, 13, 26, 23, 21,
      16, 24, 17, 18, 18, 15, 18, 22, 28, 21, 18, 17, 31, 23, 13, 20, 19,
        24, 17, 20, 13, 19, 16, 16, 21, 16, 21, 19, 20, 20, 19, 19, 20, 13,
        29, 9, 13, 20, 29, 15, 15, 21, 20, 21, 18, 16, 20, 23, 18, 22, 14,
        19, 20, 18, 17, 20, 24, 20, 15, 19, 19, 25, 17, 19, 27, 20, 17, 12,
      22, 16, 23, 17, 11, 15, 19, 16, 21, 21, 25, 26, 23, 15, 25])

In [23]: unique, counts = np.unique(rand_pois, return_counts=True)

In [24]: np.asarray((unique, counts)).T

Out[24]:

array([[ 9, 1],
      [11, 1],
        [12, 1],
        [13, 5],
      [14, 2],
      [15, 8],
      [16, 7],
      [17, 7],
        [18, 7],
        [19, 12],
        [20, 12],
        [21, 10],
        [22, 4],
      [23, 5],
       [24, 4],
      [25, 6],
        [26, 3],
        [27, 1],
        [28, 1],
        [29, 2],
        [31, 1]], dtype=int64)

In [25]: plt.bar(unique, counts, width=0.5, color="red", align='center')

Out[25]: <Container object of 21 artists>

다음으로 연속형 확률분포(continuous probability distribution)인 (2-1) 정규분포, (2-2) t-분포, (2-3) 균등분포, (2-4) F 분포, (2-5) 카이제곱분포로부터 난수를 생성하는 방법을 소개합니다.

각 분포별 이론 설명은 아래의 포스팅 링크를 참조하세요.

- 정규분포 (Normal Distribution) : http://rfriend.tistory.com/102

- t-분포 (Student's t-distribution) : http://rfriend.tistory.com/110

- 균등분포 (Uniform Distribution) : http://rfriend.tistory.com/106

- F-분포(F-distribution) : http://rfriend.tistory.com/111

- 카이제곱분포 (Chisq-distribution) : http://rfriend.tistory.com/112

(2-1) 정규분포로부터 무작위 표본 추출 : np.random.normal(loc, scale, size)

(random sampling from Normal Distribution)

평균이 '0', 표준편차가 '3'인 정규분포로 부터 난수 100개를 생성해보고, 히스토그램을 그려서 분포를 bin 구간별 빈도(frequency)와 표준화한 비율(normalized percentage)로 살펴보겠습니다.

# (2) 연속형 확률분포 (continuous probability distribution)
# (2-1) 정규분포(normal distribution)로부터 난수 생성
# Draw random samples from a normal (Gaussian) distribution
# np.random.normal(loc=0.0, scale=1.0, size=None)
# mu : Mean (“centre”) of the distribution
# sigma : Standard deviation (spread or “width”) of the distribution
# size : Output shape

In [26]: np.random.seed(100)

In [27]: mu, sigma = 0.0, 3.0

In [28]: rand_norm = np.random.normal(mu, sigma, size=100)

In [29]: rand_norm

Out[29]:

array([-5.24929642, 1.02804121, 3.45910741, -0.75730811, 2.94396236,
        1.54265652, 0.66353901, -3.21012999, -0.56848749, 0.76500433,
       -1.37408096, 1.30549046, -1.75078515, 2.45054122, 2.01816242,
       -0.31323343, -1.59384113, 3.08919806, -1.31440687, -3.35495474,
        4.85694498, 4.62481552, -0.75563742, -2.52730721, 0.55355607,
        2.8112466 , 2.19300103, 4.08466838, -0.97871418, 0.16702804,
        0.66719883, -4.32965099, -2.26905692, 2.44936203, 2.25133428,
       -1.36784078, 3.5688668 , -5.07185048, -4.06919715, -3.69730354,
       -1.63331749, -2.00451521, 0.02194369, -1.83881621, 3.89924422,
       -5.19928687, -2.9499303 , 1.07252326, -4.84073551, 4.4121416 ,
       -3.56405279, -1.64923858, -2.82013848, -2.48379709, 0.3265904 ,
        1.52342877, -2.58668204, 3.74840923, -0.23883374, -2.66919444,
       -2.64539517, 0.05591685, 0.71353387, 0.04064565, -4.9065882 ,
       -3.13262963, 1.83911665, 2.20861564, 3.08076432, -4.29657183,
       -5.5235649 , 1.09827968, -0.99533141, -2.06765393, 6.10382268,
       -1.65214324, 2.25135999, -3.92097702, 1.74172001, -3.31356928,
        2.07036441, 2.0606702 , -4.70006259, 2.71492236, 2.3364672 ,
        1.28469861, 0.32661597, 0.0848509 , -1.73647747, -3.5983536 ,
       -5.11785602, 1.10749187, 5.62972028, -1.13071005, 5.49580825,
        0.0090523 , -0.2280704 , 0.01187278, -0.55504233, -7.46145461])

In [30]: import matplotlib.pyplot as plt

# histogram with frequency

In [31]: count, bins, ignored = plt.hist(rand_norm, normed=False)

# histogram with normalized percentage
In [32]: count, bins, ignored = plt.hist(rand_norm, normed=True)

(2-2) t-분포로 부터 무작위 표본 추출 : np.random.standard_t(df, size)

(Random sampling from t-distribution)

자유도(degrees of freedom)가 '3'인 t-분포로부터 100개의 난수를 생성하고, 히스토그램을 그려보겠습니다.

# (2-2) t-분포 (Student's t-distribution)로부터 난수 생성
# Draw samples from a standard Student’s t distribution with df degrees of freedom
# np.random.standard_t(df, size=None)
# df : Degrees of freedom
# size : Output shape

In [33]: np.random.seed(100)

In [34]: rand_t = np.random.standard_t(df=3, size=100)

In [35]: rand_t

Out[35]:

array([-1.70633623, 0.61010003, 0.45753218, -0.85709656, -0.42990712,
       -0.7437467 , 0.8444005 , -0.4040428 , 2.13905276, -0.10844638,
        0.67238716, 1.88720362, -2.57340231, -0.69724955, -3.40107659,
       -0.57745433, -0.36487447, 3.95862541, 2.34665412, -0.94310449,
        0.81852816, -0.48391289, 0.01380029, -0.43003718, -2.25784604,
       -0.18216847, -1.21433582, 0.46347964, 0.50024665, -1.1595865 ,
        0.02358778, -1.18879826, -0.38767689, 2.24289791, -2.80798472,
       -2.838893 , -0.39222432, -1.61499121, -1.78498184, 0.44618923,
       -1.5181203 , 5.44389927, 4.17743903, -0.49617121, -0.02996529,
        0.89595015, 1.14860485, -3.16541308, 0.14279246, 0.83121743,
       -0.32403947, 0.59297222, -0.39750861, 0.57634934, 0.81587478,
       -1.29367024, -0.28580516, -0.48422765, -0.83697192, 0.50702557,
       -1.98915687, 2.92965716, -1.19522074, 0.65511251, 2.12055605,
       -0.03640814, -0.41931018, 3.31199804, -0.61725596, 0.79681204,
        1.86805014, -0.54345259, 3.11909936, 0.86410458, 2.66353682,
        0.23735454, -0.76306875, 0.24471792, -0.13515045, 0.26402784,
        4.68946895, 0.70573709, -0.17783758, 1.85205955, -0.18352788,
       -0.65713104, -0.73674278, 2.16549569, 1.22326388, -0.5112858 ,
       -1.54451989, -1.73428432, 0.46947115, 1.66594804, 0.51687137,
        1.51361314, -2.22193709, 0.89557421, 0.56222653, -0.55564416])

# histogram

In [36]: import matplotlib.pyplot as plt

In [37]: count, bins, ignored = plt.hist(rand_t, bins=20, normed=True)

(2-3) 균등분포로 부터 무작위 표본 추출 : np.random.uniform(low, high, size)

(random sampling from Uniform distribution)

최소값이 '0', 최대값이 '10'인 구간에서의 균등분포에서 100개의 난수를 만들어 보고, 히스토그램을 그려서 분포를 확인해보겠습니다.

# (2-3) 균등분포 (Uniform Distribution)로 부터 난수 생성
# Draw samples from a uniform distribution
# np.random.uniform(low=0.0, high=1.0, size=None)
# low : Lower boundary of the output interval
# high : Upper boundary of the output interval
# [low, high) : includes low, excludes high

In [38]: np.random.seed(100)

In [39]: rand_unif = np.random.uniform(low=0.0, high=10.0, size=100)

In [40]: rand_unif

Out[40]:

array([ 5.43404942, 2.78369385, 4.24517591, 8.44776132, 0.04718856,
        1.21569121, 6.70749085, 8.25852755, 1.3670659 , 5.75093329,
        8.91321954, 2.09202122, 1.8532822 , 1.0837689 , 2.19697493,
        9.78623785, 8.11683149, 1.71941013, 8.16224749, 2.74073747,
        4.31704184, 9.4002982 , 8.17649379, 3.3611195 , 1.75410454,
        3.72832046, 0.05688507, 2.52426353, 7.95662508, 0.15254971,
        5.98843377, 6.03804539, 1.05147685, 3.81943445, 0.36476057,
        8.90411563, 9.80920857, 0.59941989, 8.90545945, 5.76901499,
        7.42479689, 6.30183936, 5.81842192, 0.20439132, 2.10026578,
        5.44684878, 7.69115171, 2.50695229, 2.8589569 , 8.52395088,
        9.75006494, 8.84853293, 3.59507844, 5.98858946, 3.54795612,
        3.40190215, 1.7808099 , 2.37694209, 0.44862282, 5.0543143 ,
        3.76252454, 5.92805401, 6.29941876, 1.42600314, 9.33841299,
        9.46379881, 6.02296658, 3.8776628 , 3.63188004, 2.04345277,
        2.76765061, 2.46535881, 1.73608002, 9.66609694, 9.570126 ,
        5.97973684, 7.31300753, 3.40385223, 0.92055603, 4.63498019,
        5.08698893, 0.88460173, 5.28035223, 9.92158037, 3.95035932,
        3.35596442, 8.05450537, 7.54348995, 3.13066442, 6.34036683,
        5.40404575, 2.96793751, 1.10787901, 3.12640298, 4.5697913 ,
        6.5894007 , 2.54257518, 6.41101259, 2.00123607, 6.57624806])

# histogram

In [41]: import matplotlib.pyplot as plt

In [42]: count, bins, ignored = plt.hist(rand_unif, bins=10, normed=True)

이산형 균등분포에서 정수형 무작위 표본 추출 : np.random.randint(low, high, size)

(Random INTEGERS from discrete uniform distribution)

이산형 균등분포(discrete uniform distribution)으로 부터 정수형(integers)의 난수를 만들고 싶으면 np.random.randint 를 사용하면 됩니다. 모수 설정은 np.random.uniform()과 같은데요, high에 설정하는 값은 미포함(not include, but exclude)이므로 만약 0부터 10까지 (즉, 10도 포함하는) 정수형 난수를 만들고 싶으면 high=10+1 처럼 '+1'을 해주면 됩니다.

# np.random.randint : Discrete uniform distribution, yielding integers

In [43]: np.random.seed(100)

In [44]: rand_int = np.random.randint(low=0, high=10+1, size=100)

In [45]: rand_int

Out[45]:

array([ 8, 8, 3, 7, 7, 0, 10, 4, 2, 5, 2, 2, 2, 1, 0, 8, 4,
       10, 0, 9, 6, 2, 4, 1, 5, 3, 4, 4, 3, 7, 1, 1, 7, 7,
        0, 2, 9, 9, 3, 2, 5, 8, 1, 0, 7, 6, 2, 0, 8, 2, 5,
       10, 1, 8, 10, 1, 5, 4, 2, 8, 3, 5, 0, 9, 10, 3, 6, 3,
        4, 10, 7, 6, 3, 9, 0, 4, 4, 5, 7, 6, 6, 2, 10, 4, 2,
        7, 1, 10, 6, 10, 6, 0, 10, 7, 2, 3, 5, 4, 2, 4])

# histogram

In [46]: import matplotlib.pyplot as plt

In [47]: count, bins, ignored = plt.hist(rand_int, bins=10, normed=True)

(2-4) F-분포로 부터 무작위 표본 추출 : np.random.f(dfnum, dfden, size)

(Random sampling from F-distribution)

자유도1이 '5', 자유도2가 '10'인 F-분포로 부터 100개의 난수를 만들고, 히스토그램을 그려서 분포를 확인해보겠습니다.

# (2-4) F-분포 (F-distribution)으로부터 난수 생성
# Draw samples from an F-distribution (Fisher distribution)
# numpy.random.f(dfnum, dfden, size=None)
# dfnum : degrees of freedom in numerator
# dfden : degrees of freedom in denominator

In [48]: np.random.seed(100)

In [49]: rand_f = np.random.f(dfnum=5, dfden=10, size=100)

In [50]: rand_f

Out[50]:

array([ 0.17509245, 1.34830314, 0.7250835 , 0.55013536, 1.49183341,
        1.19802261, 1.24949706, 0.70015548, 0.71890936, 0.37020715,
        4.70371284, 0.86726338, 5.12146941, 0.12848202, 0.68237285,
        0.79663258, 1.36935299, 1.08005188, 0.99311831, 0.15607878,
        3.7778542 , 2.35609305, 0.16850985, 0.98599364, 1.12567067,
        3.21579679, 0.87982087, 0.38319493, 0.96834789, 1.00428004,
        1.65589171, 1.2581278 , 1.71881244, 0.11251552, 1.65949951,
        1.15809569, 1.33210756, 0.37989215, 0.252446 , 1.22409406,
        1.86571485, 0.42345727, 3.52740557, 1.32989807, 2.0095314 ,
        1.20016474, 3.5067706 , 0.67232354, 2.79268109, 0.38115844,
        1.3978449 , 0.7089553 , 2.12685211, 0.73462708, 2.03686026,
        0.50287078, 0.31183315, 1.66994305, 5.36906534, 1.55708073,
        2.66826698, 1.31701804, 0.66126086, 0.19123589, 0.58223398,
        0.41897952, 2.17842598, 0.98481411, 0.46953552, 0.99266818,
        0.4463218 , 0.43809118, 0.37791494, 2.46417893, 0.91230902,
        0.50247167, 1.0960922 , 0.61328846, 2.07107491, 0.65524443,
        4.00311763, 1.61430287, 0.16159395, 2.42851301, 1.38124899,
        0.33750889, 1.93776135, 1.55612023, 0.59284748, 0.56785228,
        1.09259657, 1.22611626, 0.0744978 , 0.10373193, 1.95616674,
        2.29130443, 0.62968361, 0.67477008, 0.60981642, 0.58408102])

# histogram

In [51]: import matplotlib.pyplot as plt

In [52]: count, bins, ignored = plt.hist(rand_f, bins=20, normed=True)

(2-5) 카이제곱분포로 부터 무작위 표본 추출 : np.random.chisquare(df, size)

(Random sampling from Chisq-distribution)

자유도(degrees of freedom)가 '2'인 카이제곱분포로부터 100개의 난수를 생성하고, 히스토그램을 그려서 분포를 확인해보겠습니다.

# (2-5) 카이제곱분포 (Chisq-distribution)로 부터 난수 생성
# Draw samples from a chi-square distribution
# np.random.chisquare(df, size=None)
# df : Number of degrees of freedom

In [53]: np.random.seed(100)

In [54]: rand_chisq = np.random.chisquare(df=2, size=100)

In [55]: rand_chisq

Out[55]:

array([ 1.56791674e+00,   6.52483769e-01,   1.10509323e+00,
         3.72577379e+00,   9.46005029e-03,   2.59236110e-01,
         2.22187032e+00,   3.49570821e+00,   2.94001314e-01,
         1.71177147e+00,   4.43873095e+00,   4.69425742e-01,
         4.09939940e-01,   2.29423517e-01,   4.96147209e-01,
         7.69095282e+00,   3.33925872e+00,   3.77341773e-01,
         3.38808346e+00,   6.40613699e-01,   1.13022638e+00,
         5.62781567e+00,   3.40364791e+00,   8.19283487e-01,
         3.85739072e-01,   9.33081811e-01,   1.14094971e-02,
         5.81844910e-01,   3.17596456e+00,   3.07450508e-02,
         1.82680669e+00,   1.85169520e+00,   2.22193172e-01,
         9.62350625e-01,   7.43158820e-02,   4.42204683e+00,
         7.91831906e+00,   1.23627383e-01,   4.42450081e+00,
         1.72030053e+00,   2.71331337e+00,   1.98949905e+00,
         1.74379278e+00,   4.13018033e-02,   4.71511953e-01,
         1.57353105e+00,   2.93167254e+00,   5.77218949e-01,
         6.73452471e-01,   3.82643217e+00,   7.37827846e+00,
         4.32309651e+00,   8.91036809e-01,   1.82688432e+00,
         8.76376262e-01,   8.31607381e-01,   3.92226832e-01,
         5.42815006e-01,   9.17994841e-02,   1.40813894e+00,
         9.44019133e-01,   1.79692816e+00,   1.98819039e+00,
         3.07702183e-01,   5.43139774e+00,   5.85166185e+00,
         1.84409785e+00,   9.81282349e-01,   9.02561614e-01,
         4.57179904e-01,   6.48042320e-01,   5.66147763e-01,
         3.81372088e-01,   6.79897935e+00,   6.29369648e+00,
         1.82247546e+00,   2.62832513e+00,   8.32198571e-01,
         1.93144279e-01,   1.24537005e+00,   1.42139617e+00,
         1.85239984e-01,   1.50170184e+00,   9.69653206e+00,
         1.00517243e+00,   8.17731091e-01,   3.27413768e+00,
         2.80768685e+00,   7.51035408e-01,   2.01044435e+00,
         1.55481738e+00,   7.04210092e-01,   2.34838981e-01,
         7.49795080e-01,   1.22121505e+00,   2.15139414e+00,
         5.86749873e-01,   2.04942998e+00,   4.46596145e-01,
         2.14369617e+00])

# histogram

In [56]: import matplotlib.pyplot as plt

In [57]: count, bins, ignored = plt.hist(rand_chisq, bins=20, normed=True)

많은 도움 되었기를 바랍니다.

pandas DataFrame 에 대한 무작위 표본 추출 방법은 https://rfriend.tistory.com/602 를 참고하세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars) (2)	2017.02.04
[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays) (0)	2017.01.30
[Python NumPy] 다차원 배열 ndarray 만들기 (0)	2017.01.14
[Python pandas] Series, DataFrame 행, 열 생성(creation), 선택(selection, slicing, indexing), 삭제(drop, delete) (0)	2017.01.03
[Python] 데이터 정렬 (sort, arrange) : DataFrame.sort_values(), sorted(), list.sort() (2)	2016.12.31

Posted by Rfriend

,

[Python NumPy] 다차원 배열 ndarray 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2017. 1. 14. 22:57

그동안 Python Pandas의 다양한 method, 함수들에 대해서 알아보았습니다.

이번 포스팅부터는 Python NumPy에 대해서 연재를 해볼까 합니다. 그동안의 Pandas 포스팅에서 항상 빠지지않고 더불어 불러왔던 모듈이 바로 NumPy 입니다. [ import numpy as np ] 라고 (거의) 항상 첫번째 줄에 쓰곤했었지요.

약방에서 약 제조 시 빠지지 않고 넣는게 '감초'라면, Python을 가지고 데이터 분석을 할 때 거의 빠지지 않고 등장하는 모듈이 NumPy라고 보면 되겠습니다. (라티댄스의 퀵~퀵~슬로우~ 스텝이 NumPy, 파트너와 추는 패턴은 Pandas라고 비유해볼 수도 있겠네요. 기본 스텝이 불안정하면 춤을 제대로 출 수가 없지요!)

NumPy (Numerical Python)로 ndarray 라는 매우 빠르고 공간 효율적며 벡터 연산(vectorized arithmetic operations)이 가능한 다차원 배열(n-dimentional array)을 만들 수 있습니다.

그리고 NumPy는 loop 프로그래밍 없이 전체 배열에 대해서 표준 수학 함수를 매우 빠른 속도로 수행할 수 있습니다. 예전에 250만개 row를 가진 DataFrame에 대해서 데이터 전처리하면서 loop 함수를 썼더니 4시간이 지나도 끝나지가 않던것을요, NumPy 함수를 썼더니 단 몇 분만에 끝나서 깜짝 놀란 적이 있습니다. (다르게 얘기하면, Python에서 Loop 돌리는게 속도, 성능에는 아주 안좋습니다. 대용량 데이터에 Loop 쓸 때는 성능 이슈 고려하시길... Loop 안쓰고 NumPy 함수로 대신할 수 있는지 꼭 확인해보세요.)

선형대수(Linear Algebra), 무작위 난수 생성(Random number generation) 등에 NumPy를 사용합니다.

이번 포스팅에서는 다양한 형태의 ndarray를 생성하는 방법에 대해서 알아보겠습니다.

ndarray 의 행과 열 내의 모든 원소는 동일한 형태의 데이터(all of the elements must be the same type)를 가져야 합니다.

[ Creating NumPy's ndarrary ]

* array image source: https://www.pinterest.com/pin/342062534176033515/

NumPy 모듈을 불어올 때 애칭(alias name)으로 'np' 를 사용합니다.

# importing numpy module

In [1]: import numpy as np

NumPy의 array() 로 ndarray 를 만들어보겠습니다.

# making ndarrays : np.array() function

In [2]: arr1 = np.array([1, 2, 3, 4, 5])

In [3]: arr1

Out[3]: array([1, 2, 3, 4, 5])

list 객체를 먼저 만들고 np.array() 를 사용해서 배열로 변환해도 됩니다.

In [4]: data_list = [6, 7, 8, 9, 10] # a list

In [5]: arr2 = np.array(data_list) # converting a list into an array

In [6]: arr2

Out[6]: array([ 6, 7, 8, 9, 10])

같은 길이의 여러개의 리스트(lists)를 가지고 있는 리스트(a list)도 np.array()를 사용해서 배열(array)로 변환할 수 있습니다.

# a list of equal-length lists

In [7]: data_lists = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]

# converting a list of equal-length lists into a multi-dimensional array

In [8]: arr12 = np.array(data_lists)

In [9]: arr12

Out[9]:

array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])

다차원 배열의 차원과 그룹 안의 원소들의 크기를 확인할 때 array.shape 함수를 사용합니다. shape에 indexing 을 적용하면 차원, 원소 크기를 각 각 선택해서 반환합니다 (종종 사용함)

array.dtype 함수를 사용해서 데이터 유형을 확인할 수 있습니다.

# a tuple indicating the size of each dimension

In [10]: arr12.shape # (2, 5)

Out[10]: (2, 5)

# indexing ndarray.shape[0], ndarray.shape[1]

In [11]: arr12.shape[0] # 2 : n-dimension

Out[11]: 2

In [12]: arr12.shape[1] # 5 : size of elements in nested sequence array

Out[12]: 5

# an object describing the data type of the array

In [13]: arr12.dtype # dtype('int32')

Out[13]: dtype('int32')

np.asarray() 를 사용해서 array 를 만들기도 합니다.

# np.asarray : Convert the input to an array

In [14]: a = [1, 2, 3, 4, 5] # a list

In [15]: type(a)

Out[15]: list

In [16]: a = np.asarray(a)

In [17]: type(a)

Out[17]: numpy.ndarray

대신 np.array()와는 달리 np.asarray()는 이미 ndarray가 있다면 복사(copy)를 하지 않습니다.

# Existing arrays are not copied

In [18]: b = np.array([1, 2])

In [19]: np.asarray(b) is b

Out[19]: True

만약 데이터 형태(data type)이 이미 설정이 되어 있다면, np.asarray() 를 사용해서 array로 변환하려고 할 경우 데이터 형태가 다를 경우에만 복사(copy)가 됩니다.

# If dtype is set, array is copied only if dtype does not match

In [20]: c = np.array([1, 2], dtype=np.float32)

In [21]: np.asarray(c, dtype=np.float32) is c # not copied

Out[21]: True

In [22]: np.asarray(c, dtype=np.float64) is c # copied

Out[22]: False

float 데이터 형태를 원소로 가지는 배열을 만들고 싶을 때는 np.asfarray() 함수를 사용합니다.

# np.asfarrary: Convert input to a floating point ndarray

In [23]: d = [6, 7, 8, 9, 10]

In [24]: np.asfarray(d)

Out[24]: array([ 6., 7., 8., 9., 10.])

np.asarray_chkfinite() 함수를 쓰면 배열로 만들려고 하는 데이터 input에 결측값(NaN)이나 무한수(infinite number)가 들어있을 경우 'ValueError'를 반환하게 할 수 있습니다. np.nan 으로 결측값을 넣어보고, np.inf 로 무한수를 추가해서 확인해보겠습니다.

# asarray_chkfinite : Convert a list into an array.
# If all elements are finite asarray_chkfinite is identical to asarray

In [25]: e = [11, 12, 13, 14, 15]

In [26]: np.asarray_chkfinite(e, dtype=float)

Out[26]: array([ 11., 12., 13., 14., 15.])

In [27]: e_2 = [11, 12, 13, np.nan, 15] # with NaN

In [28]: np.asarray_chkfinite(e_2, dtype=float)

ValueError: array must not contain infs or NaNs

In [29]: e_3 = [11, 12, 13, 14, np.inf] # with infinite

In [30]: np.asarray_chkfinite(e_3, dtype=float)

ValueError: array must not contain infs or NaNs

np.zeros(), np.ones(), np.empty() 함수는 괄호 안에 쓴 숫자 개수만큼의 '0', '1', '비어있는 배열' 공간을 만들어줍니다.

# making arrays of 0's or 1's
# np.zeros : Produce an array of all 0's with the given shape and dtype

In [31]: np.zeros(5) # an array of 5 zeros

Out[31]: array([ 0., 0., 0., 0., 0.])

# np.ones : Produce an array of all 1's with the given shape and dtype

In [32]: np.ones(10) # an array of 10 ones

Out[32]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

# making new arrays by allocating new memory,
# but do not populate with any values like ones and zeros

In [33]: np.empty(10) # an array of 10 initialized empty-values

Out[33]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

np.zeros((2, 5)), np.ones((5, 3)), np.empty((4, 3)) 처럼 괄호 안에 tuple을 넣어주면 다차원 배열(multi-dimensional array)을 만들 수 있습니다.

# multi-dimensional array by passing a tuple for the shape

In [34]: np.zeros((2, 5)) # 2 by 5 array with '0' elements

Out[34]:

array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])

In [35]: np.ones((5, 3)) # 5 by 3 array with '1' elements

Out[35]:

aarray([[ 1., 1., 1.],
          [ 1., 1., 1.],
          [ 1., 1., 1.],
        [ 1., 1., 1.],
        [ 1., 1., 1.]])

In [36]: np.empty((4, 3))

Out[36]:

array([[ 9.88131292e-324,   0.00000000e+000,   0.00000000e+000],
        [ 0.00000000e+000,   0.00000000e+000,   0.00000000e+000],
        [ 0.00000000e+000,   0.00000000e+000,   0.00000000e+000],
        [ 0.00000000e+000,   0.00000000e+000,   0.00000000e+000]])

np.arange() 는 Python의 range() 함수처럼 0부터 괄호안의 숫자 만큼의 정수 배열 값을 '배열'로 반환합니다. (returns not a list, but an array) 이 함수도 종종 사용하는 편이예요.

# np.arange : an array-valued version of the built-in Python range function
# Like the bulit-in range but returns an ndarray insted of a list

In [38]: f = np.arange(10)

In [39]: f

Out[39]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.zeros_like(), np.ones_like(), np.empty_like() 함수는 이미 있는 array와 동일한 모양과 데이터 형태를 유지한 상태에서 각 각 '0', '1', '빈 배열'을 반환합니다.

# np.zeros_like(), np.ones_like(), np.empty_like()
# Return an array of ones or zeros with the same shape and type as a given array

In [40]: f_2_5 = f.reshape(2, 5)

In [41]: f_2_5

Out[41]:

ararray([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

# return an arrary of zeros with the same shape and type

In [42]: np.zeros_like(f_2_5)

Out[42]:

array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])

# return an arrary of ones with the same shape and type

In [43]: np.ones_like(f_2_5)

Out[43]:

array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])

# return an arrary of initialized values with the same shape, type

In [44]: np.empty_like(f_2_5)

Out[44]:

array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])

np.identity() 혹은 np.eye() 함수를 사용하면 대각성분은 '1'이고 나머지 성분은 '0'으로 구성된 정방행렬인 항등행렬(identity matrix) 혹은 단위행렬(unit matrix)을 만들 수 있습니다.

특수한 형태의 행렬에 대해서는 아래 포스팅을 참고하세요.

☞ http://rfriend.tistory.com/141

# np.identity, np.eye : making a square N x N identity matrix, unit matrix
# identity matrix : 1's on the diagonal and 0's elsewhere

In [45]: np.identity(5)

Out[45]:

array([[ 1., 0., 0., 0., 0.],
        [ 0., 1., 0., 0., 0.],
        [ 0., 0., 1., 0., 0.],
        [ 0., 0., 0., 1., 0.],
      [ 0., 0., 0., 0., 1.]])

In [46]: np.eye(5)

Out[46]:

array([[ 1., 0., 0., 0., 0.],
       [ 0., 1., 0., 0., 0.],
      [ 0., 0., 1., 0., 0.],
        [ 0., 0., 0., 1., 0.],
      [ 0., 0., 0., 0., 1.]])

이상 Python NumPy 모듈을 사용해서 ndarray 만드는 방법 소개를 마치겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays) (0)	2017.01.30
[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation) (4)	2017.01.21
[Python pandas] Series, DataFrame 행, 열 생성(creation), 선택(selection, slicing, indexing), 삭제(drop, delete) (0)	2017.01.03
[Python] 데이터 정렬 (sort, arrange) : DataFrame.sort_values(), sorted(), list.sort() (2)	2016.12.31
[Python] 데이터 재구조화(reshape) : pd.crosstab() 사용해 교차표(cross tabulation) (0)	2016.12.30

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Python 분석과 프로그래밍'에 해당되는 글 272건

[Python NumPy] 정수 배열을 사용해서 다차원 배열 인덱싱 하기 : Fancy Indexing

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] Boolean 조건문으로 배열 인덱싱 (Boolean Indexing)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 배열의 일부분 선택하기, indexing and slicing an ndarray

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 행렬의 행과 열 바꾸기, 축 바꾸기, 전치행렬 : a.T, np.transpose(a), np.swapaxes(a,0,1)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] NumPy 배열에 축 추가하기 (adding axis to NumPy Array) : np.newaxis, np.tile

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] 다른 차원의 배열 간 산술연산 시 Broadcasting

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 배열과 배열, 배열과 스칼라 연산 (Numerical Operations between Arrarys and Scalars)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] ndarray 데이터 형태 지정 및 변경 (Data Types for ndarrays)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 무작위 표본 추출, 난수 만들기 (random sampling, random number generation)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python NumPy] 다차원 배열 ndarray 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바