R, Python 분석과 프로그래밍의 친구 (by R Friend)

'split'에 해당되는 글 2건

[PyTorch] 텐서 나누기 (splitting a PyTorch tensor into multiple tensors)

Deep Learning (TF, Keras, PyTorch)/PyTorch basics 2023. 2. 23. 00:33

지난 포스팅에서는 두 개의 PyTorch 텐서를 합치기 (concatenating two PyTorch tensors) 에 대해서 다루었습니다.
(바로가기 ==> https://rfriend.tistory.com/781 )

이번 포스팅에서는 반대로 한 개의 PyTorch 텐서를 복수 개로 나누기 (splitting a tensor into multiple tensors) 하는 방법을 소개하겠습니다.

(1) 하나의 PyTorch 텐서를 위-아래의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors vertically)

: torch.vsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=0)

(2) 하나의 PyTorch 텐서를 좌-우의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors horizontally)

: torch.hsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=1)

먼저 PyTorch 텐서를 위-아래의 수직으로 복수 개의 텐서로 나누기를 해보겠습니다.

(1) 하나의 PyTorch 텐서를 위-아래의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors vertically)
: torch.vsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=0)

torch.vsplit() 과 torch.split(dim=0) 의 경우 매개변수의 사용법에 차이가 있습니다.

torch.vsplit(tensor, indices_or_sections) 은 indices_or_sections 매개변수로 input tensor 의 행(row) 을 몇으로 나눌지(indices)를, 그래서 결국 몇 개의 텐서로 나누고 싶은지를 지정해주는 것입니다. 가령, 아래 예의 경우 텐서 z 는 Size[6, 6] 인데요, indices_or_sections=2 로 입력할 경우, 6/2 = 3 으로서 각 3개의 균등한 행을 가지는 2개의 텐서로 나누어줍니다.

반면에, torch.split(tensor, split_size_or_sections, dim=0) 은 (a) "dim=0" 으로 행에 대해서 수직으로 텐서 분리를 수행하라고 지정을 해주어야 하고, (b) split_size_or_sections 에서 지정한 숫자만큼의 크기(split_size)로 행을 가지도록 텐서를 분리해줍니다. 가령, 아래 예의 Size[6, 6] 의 텐서 z에 대해서 torch.split(z, 3, dim=0) 으로서 split_size_or_sections = 3 을 입력해주면 각 행을 3개씩의 크기(split size)로 가지는 텐서로 분리를 해줍니다. (원래 텐서에 행이 6개 있는데, 분리할 텐서는 각 3개씩 행을 가지라고 했으므로 결과적으로 총 2개의 텐서로 분리가 됨).

PyTorch: splitting a tensor into multiple tensors vertically

예제로 사용할 Size[6, 6]의 PyTorch 텐서 z 를 만들어보겠습니다.

z = torch.arange(36).reshape(6,6)

print(z)
# tensor([[ 0,  1,  2,  3,  4,  5],
#         [ 6,  7,  8,  9, 10, 11],
#         [12, 13, 14, 15, 16, 17],
#         [18, 19, 20, 21, 22, 23],
#         [24, 25, 26, 27, 28, 29],
#         [30, 31, 32, 33, 34, 35]])

텐서 z 를 torch.vsplit(z, 2) 를 사용해서 2개의 텐서로 분리를 해보겠습니다. (indices_or_sections = 2 로 입력하면 6을 2로 나누어서 3개씩의 행을 가지는 2개의 텐서로 분리해줌)

## vsplit(): Splits input, a tensor with two or more dimensions, 
## into multiple tensors vertically according to indices_or_sections.
torch.vsplit(z, 2)

# (tensor([[ 0,  1,  2,  3,  4,  5],
#          [ 6,  7,  8,  9, 10, 11],
#          [12, 13, 14, 15, 16, 17]]),
#  tensor([[18, 19, 20, 21, 22, 23],
#          [24, 25, 26, 27, 28, 29],
#          [30, 31, 32, 33, 34, 35]]))

이때 반환되는 아웃풋은 두 개의 텐서를 묶어놓은 튜플(tuple)입니다.

type(torch.vsplit(z, 2))
# tuple

위에서 2개로 분리한 텐서들의 묶음인 튜플에 원하는 부분의 텐서에 접근하기 위해서는 인덱싱(indexing)을 사용하면 됩니다. 아래 예에서는 2개로 분리한 텐서의 각 첫번째와 두번째 튜플에 접근해서 가져와봤습니다.

## accessing a tensor after splitting
torch.vsplit(z, 2)[0]

# tensor([[ 0,  1,  2,  3,  4,  5],
#         [ 6,  7,  8,  9, 10, 11],
#         [12, 13, 14, 15, 16, 17]])


torch.vsplit(z, 2)[1]

# tensor([[18, 19, 20, 21, 22, 23],
#         [24, 25, 26, 27, 28, 29],
#         [30, 31, 32, 33, 34, 35]])

torch.split(z, 3, dim=0) 은 dim=0 으로 지정을 해주면 됩니다.

## Splits the tensor into chunks. 
## Each chunk is a view of the original tensor.
torch.split(z, 3, dim=0)

# (tensor([[ 0,  1,  2,  3,  4,  5],
#          [ 6,  7,  8,  9, 10, 11],
#          [12, 13, 14, 15, 16, 17]]),
#  tensor([[18, 19, 20, 21, 22, 23],
#          [24, 25, 26, 27, 28, 29],
#          [30, 31, 32, 33, 34, 35]]))

## split_size_or_sections : list of sizes for each chunk
torch.split(z, [1,2,3], dim=0)

# (tensor([[0, 1, 2, 3, 4, 5]]),
#  tensor([[ 6,  7,  8,  9, 10, 11],
#          [12, 13, 14, 15, 16, 17]]),
#  tensor([[18, 19, 20, 21, 22, 23],
#          [24, 25, 26, 27, 28, 29],
#          [30, 31, 32, 33, 34, 35]]))

(2) 하나의 PyTorch 텐서를 좌-우의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors horizontally)

: torch.hsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=1)

이번에는 하나의 PyTorch 텐서를 좌-우 수평으로해서 복수 개의 텐서로 나누어볼텐데요, 역시 torch.hsplit() 과 torch.split(dim=1) 의 경우 매개변수의 사용법에 차이가 있습니다.

torch.hsplit(tensor, indices_or_sections) 은 indices_or_sections 매개변수로 input tensor 의 열(column) 을 몇으로 나눌지(indices)를, 그래서 결국 몇 개의 텐서로 나누고 싶은지를 지정해주는 것입니다. 가령, 아래 예의 경우 텐서 z 는 Size[6, 6] 인데요, indices_or_sections=2 로 입력할 경우, 6/2 = 3 으로서 각 3개의 균등한 열을 가지는 2개의 텐서로 좌-우로 나누어줍니다.

반면에, torch.split(tensor, split_size_or_sections, dim=1) 은 (a) "dim=1" 으로 행에 대해서 수평으로 텐서 분리를 수행하라고 지정을 해주어야 하고, (b) split_size_or_sections 에서 지정한 숫자만큼의 크기(split_size)로 행을 가지도록 텐서를 분리해줍니다. 가령, 아래 예의 Size[6, 6] 의 텐서 z에 대해서 torch.split(z, 3, dim=0) 으로서 split_size_or_sections = 3 을 입력해주면 각 열(column)을 3개씩의 크기(split size)로 가지는 텐서로 분리를 해줍니다. (원래 텐서에 열이 6개 있는데, 분리할 텐서는 각 3개씩 열을 가지라고 했으므로 결과적으로 총 2개의 텐서로 분리가 됨).

PyTorch hsplit(), split(dim=1) : splitting a tensor into multiple tensors horizontally

## hsplit(): Splits input, a tensor with two or more dimensions, 
## into multiple tensors horizontally according to indices_or_sections.

torch.hsplit(z, 2)

# (tensor([[ 0,  1,  2],
#          [ 6,  7,  8],
#          [12, 13, 14],
#          [18, 19, 20],
#          [24, 25, 26],
#          [30, 31, 32]]),
#  tensor([[ 3,  4,  5],
#          [ 9, 10, 11],
#          [15, 16, 17],
#          [21, 22, 23],
#          [27, 28, 29],
#          [33, 34, 35]]))

torch.hsplit(tensor, indices_or_sections) 에서 indices_or_sections 매개변수로 indices 넣어줄 때는 정수로 나누어지는 값을 넣어주어야 합니다. 가령, 아래 예에서는 Size[6, 6] 의 텐서에서 dimension 1 의 방향으로 4개 나누라고 지정해었더닌 6을 4로 나눌 수 없다면서 RunTimeError: torch.hsplit attempted to split along dimension 1, but size of the dimension 6 is not divisible by the split_size 4! 라는 에러가 발생했습니다.

## RuntimeError
torch.hsplit(z, 4)

# RuntimeError: torch.hsplit attempted to split along dimension 1, 
# but the size of the dimension 6 is not divisible by the split_size 4!

torch.split(z, 3, dim=1) 에서는 dim=1 로 차원을 지정해주면 됩니다.

torch.split(z, 3, dim=1)

# (tensor([[ 0,  1,  2],
#          [ 6,  7,  8],
#          [12, 13, 14],
#          [18, 19, 20],
#          [24, 25, 26],
#          [30, 31, 32]]),
#  tensor([[ 3,  4,  5],
#          [ 9, 10, 11],
#          [15, 16, 17],
#          [21, 22, 23],
#          [27, 28, 29],
#          [33, 34, 35]]))

torch.split() 함수에서 split_size_or_sections 매개변수에 리스트로 해서 나누고 싶은 텐서의 크기 (split_size_sections) 를 복수개로 지정해줄 수도 있습니다. 아래의 예에서는 Size[6, 6]의 텐서를 열의 개수를 1개, 2개, 3개를 가지는 총 3개의 텐서로 분리(split_size_or_sections = [1, 2, 3])해 본 것입니다. 매우 편리한 기능입니다!

## split_size_or_sections : list of sizes for each chunk
torch.split(z, [1,2,3], dim=1)

# (tensor([[ 0],
#          [ 6],
#          [12],
#          [18],
#          [24],
#          [30]]),
#  tensor([[ 1,  2],
#          [ 7,  8],
#          [13, 14],
#          [19, 20],
#          [25, 26],
#          [31, 32]]),
#  tensor([[ 3,  4,  5],
#          [ 9, 10, 11],
#          [15, 16, 17],
#          [21, 22, 23],
#          [27, 28, 29],
#          [33, 34, 35]]))

이번 포스팅이 많은 도움이 되었기를 바랍니다.

행복한 데이터 과학자 되세요! :-)

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch) > PyTorch basics' 카테고리의 다른 글

딥러닝에서 오차역전파법 (Backpropagation, Backward Propagation of Errors) 이란? (0)	2023.12.12
[PyTorch] torchvision.datasets 모듈에 내장되어 있는 데이터 가져와서 시각화하기 (0)	2023.02.26
[PyTorch] 텐서 합치기 (concat, stack) (0)	2023.02.21
[PyTorch] 텐서의 인덱싱과 슬라이싱 (indexing & slicing of PyTorch tensor) (0)	2023.02.19
[PyTorch] NumPy의 array 대비 PyTorch 의 성능 비교 (0)	2023.02.19

Posted by Rfriend

[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups)

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 8. 26. 18:20

이번 포스팅에서는 Python pandas의 groupby() 연산자를 사용하여 집단, 그룹별로 데이터를 집계, 요약하는 방법을 소개하겠습니다.

전체 데이터를 그룹 별로 나누고 (split), 각 그룹별로 집계함수를 적용(apply) 한후, 그룹별 집계 결과를 하나로 합치는(combine) 단계를 거치게 됩니다. (Split => Apply function => Combine)

[ GroupBy aggregation mechanics ]

groupby() 는 다양한 변수를 가진 데이터셋을 분석하는데 있어 그룹별로 데이터를 집계하는 분석은 일상적으로 이루어지는 만큼 사용빈도가 매우 높고 알아두면 유용합니다.

실습에 사용할 예제는 바다 해산물인 전복(abalone)에 대한 공개 데이터셋을 사용하겠습니다.

[ UCI Machine Learning Repository ]

Abalone Data Set 설명: http://archive.ics.uci.edu/ml/datasets/Abalone
Abalone Data Set: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data

Abalone CSV dataset download: abalone.txt

Variables

Name Data Type Meas. Description ---- --------- ----- ----------- Sex nominal M, F, and I (infant) Length continuous mm Longest shell measurement Diameter continuous mm perpendicular to length Height continuous mm with meat in shell Whole weight continuous grams whole abalone Shucked weight continuous grams weight of meat Viscera weight continuous grams gut weight (after bleeding) Shell weight continuous grams after being dried Rings integer +1.5 gives the age in years

먼저, 바로 위에 링크해놓은 abalone.txt를 다운받은 후에, abalone.txt 데이터셋을 pandas의 read_csv() 로 불러와서 DataFrame을 만들어보겠습니다.

# Importing common libraries

import pandas as pd

from pandas import DataFrame

from pandas import Series

import numpy as np

# Reading abalone data set

abalone = pd.read_csv("/Users/ihongdon/Documents/Python/abalone.txt",

sep=",",

names = ['sex', 'length', 'diameter', 'height',

'whole_weight', 'shucked_weight', 'viscera_weight',

'shell_weight', 'rings'],

header = None)

abalone 라는 이름의 pandas DataFrame을 만들었으니, 데이터가 어떻게 생겼는지 탐색해보겠습니다. 다행히 결측치는 없으며, 4,177개의 관측치를 가지고 있네요. 전복의 성별(sex) 변수가 범주형 변수입니다.

# View of top 5 observations

abalone.head()

sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings

0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15

1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7

2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9

3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10

4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

# Check the missing value

np.sum(pd.isnull(abalone))

sex               0
length            0
diameter          0
height            0
whole_weight      0
shucked_weight    0
viscera_weight    0
shell_weight      0
rings             0
dtype: int64

# Descriptive statics

abalone.describe()

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
count	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000
mean	0.523992	0.407881	0.139516	0.828742	0.359367	0.180594	0.238831	9.933684
std	0.120093	0.099240	0.041827	0.490389	0.221963	0.109614	0.139203	3.224169
min	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	0.450000	0.350000	0.115000	0.441500	0.186000	0.093500	0.130000	8.000000
50%	0.545000	0.425000	0.140000	0.799500	0.336000	0.171000	0.234000	9.000000
75%	0.615000	0.480000	0.165000	1.153000	0.502000	0.253000	0.329000	11.000000
max	0.815000	0.650000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

자, 데이터 준비가 되었으니 이제부터 '전복 성별(sex)' 그룹('F', 'M', 'I')별로 전복의 전체 무게('whole_weight') 변수에 대해서 GroupBy 집계를 해보겠습니다.

집단별 크기는 grouped.size(), 집단별 합계는 grouped.sum(), 집단별 평균은 grouped.mean() 을 사용합니다.

grouped = abalone['whole_weight'].groupby(abalone['sex'])

grouped

<pandas.core.groupby.SeriesGroupBy object at 0x112668c10>

grouped.size()

sex
F    1307
I    1342
M    1528
Name: whole_weight, dtype: int64

grouped.sum()

sex
F    1367.8175
I     578.8885
M    1514.9500
Name: whole_weight, dtype: float64

grouped.mean()

sex
F    1.046532
I    0.431363
M    0.991459
Name: whole_weight, dtype: float64

위의 예에서는 'whole_weight' 라는 하나의 연속형 변수에 대해서만 '성별(sex)' 집계를 하였습니다만, 집계를 하는 key를 제외한 데이터프레임 안의 전체 연속형 변수에 대해서 한꺼번에 집계를 할 수도 있습니다.

abalone.groupby(abalone['sex']).mean()

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
sex
F	0.579093	0.454732	0.158011	1.046532	0.446188	0.230689	0.302010	11.129304
I	0.427746	0.326494	0.107996	0.431363	0.191035	0.092010	0.128182	7.890462
M	0.561391	0.439287	0.151381	0.991459	0.432946	0.215545	0.281969	10.705497

DataFrame.groupby('key_var').func() 형식으로도 사용가능하며, 위의 abalone.groupby(abalone['sex']).mean()은 아래처럼 abalone.groupby('sex').mean() 처럼 써도 똑같은 결과를 얻을 수 있습니다.

abalone.groupby('sex').mean()

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
sex
F	0.579093	0.454732	0.158011	1.046532	0.446188	0.230689	0.302010	11.129304
I	0.427746	0.326494	0.107996	0.431363	0.191035	0.092010	0.128182	7.890462
M	0.561391	0.439287	0.151381	0.991459	0.432946	0.215545	0.281969	10.705497

이제부터는 '성별(sex)' 이외에 '길이(length)'를 가지고 범주형 변수를 하나 더 만들어서, 2개의 범주형 변수 key 값을 가지고 그룹별 집계를 해보겠습니다.

np.where() 함수를 사용하여 length 의 중앙값보다 크면 'length_long'으로, 중앙값보다 작으면 'length_short'의 이름으로하는 계급으로하는 새로운 범주형 변수를 만들어보겠습니다.

abalone['length_cat'] = np.where(abalone.length > np.median(abalone.length),

'length_long', # True

'length_short') # False

abalone[['length', 'length_cat']][:10]

length length_cat

0 0.455 length_short

1 0.350 length_short

2 0.530 length_short

3 0.440 length_short

4 0.330 length_short

5 0.425 length_short

6 0.530 length_short

7 0.545 length_short

8 0.475 length_short

9 0.550 length_long

그럼, 이제 성별 그룹(sex)과 길이 범주(length_cat) 그룹별로 GroupBy 를 사용하여 평균을 구해보겠습니다.

mean_by_sex_length = abalone['whole_weight'].groupby([abalone['sex'], abalone['length_cat']]).mean()

mean_by_sex_length

sex  length_cat  
F    length_long     1.261330
     length_short    0.589702
I    length_long     0.923215
     length_short    0.351234
M    length_long     1.255182
     length_short    0.538157
Name: whole_weight, dtype: float64

위의 집계 결과가 SQL로 집계했을 때의 형태로 결과가 제시가 되었는데요, unstack() 함수를 사용하면 집계 결과를 가로, 세로 축으로 좀더 보기에 좋게 표현을 할 수 있습니다.

mean_by_sex_length.unstack()

length_cat	length_long	length_short
sex
F	1.261330	0.589702
I	0.923215	0.351234
M	1.255182	0.538157

abalone['whole_weight'].groupby([abalone['sex'], abalone['length_cat']]).mean() 를 좀더 간결하게 아래처럼 쓸 수도 있습니다. 대상 데이터프레임을 제일 앞에 써주고, groupby()에 집계의 기준이 되는 key 변수들을 써주고, 제일 뒤에 집계하려는 연속형 변수이름을 써주었습니다.

abalone.groupby(['sex', 'length_cat'])['whole_weight'].mean()

sex  length_cat  
F    length_long     1.261330
     length_short    0.589702
I    length_long     0.923215
     length_short    0.351234
M    length_long     1.255182
     length_short    0.538157
Name: whole_weight, dtype: float64

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31
[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups) (0)	2018.08.26
[Python pandas] 다수개의 범주형자료로 가변수 만들기 (dummy variable) (2)	2018.08.21
[Python NumPy] 선형대수 함수 (Linear Algebra) (0)	2018.08.15
[Python] numpy 배열을 여러개의 하위 배열로 분할하기 (split an array into sub-arrays) (0)	2018.05.22

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'split'에 해당되는 글 2건

[PyTorch] 텐서 나누기 (splitting a PyTorch tensor into multiple tensors)

(1) 하나의 PyTorch 텐서를 위-아래의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors vertically)
: torch.vsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=0)

(2) 하나의 PyTorch 텐서를 좌-우의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors horizontally)

'Deep Learning (TF, Keras, PyTorch) > PyTorch basics' 카테고리의 다른 글

[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'split'에 해당되는 글 2건

[PyTorch] 텐서 나누기 (splitting a PyTorch tensor into multiple tensors)

(1) 하나의 PyTorch 텐서를 위-아래의 복수 개의 텐서로 나누기 (splitting a tensor into multiple tensors vertically) : torch.vsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=0)

(2) 하나의 PyTorch 텐서를 좌-우의 복수 개의 텐서로 나누기 (splitting a tensor into multiple tensors horizontally)

'Deep Learning (TF, Keras, PyTorch) > PyTorch basics' 카테고리의 다른 글

[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

(1) 하나의 PyTorch 텐서를 위-아래의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors vertically)
: torch.vsplit(tensor, indices_or_sections), torch.split(tensor, split_size_or_sections, dim=0)

(2) 하나의 PyTorch 텐서를 좌-우의 복수 개의 텐서로 나누기
(splitting a tensor into multiple tensors horizontally)