[Python pandas] DataFrame의 행, 열, (행, 열) 튜플 순환 반복하기 (iterate over pandas DataFrame rows, columns, tuple(index, row, columns))

Python 분석과 프로그래밍/Python 데이터 전처리 2021. 1. 10. 19:25

이번 포스팅에서는 pandas 모듈의 DataFrame.iterrows(), DataFrame.iteritems(), DataFrame.itertuples() 의 메소드 3총사와 for loop 반복문을 활용하여 pandas DataFrame 자료의 행, 열, (행, 열) 튜플에 대해서 순환 반복 (for loop iteration) 하여 자료를 반환하는 방법을 소개하겠습니다.

(1) pd.DataFrame.iterrows() : 행에 대해 순환 반복
(Iterate over DataFrame rows as (index, Series) pairs.)

(2) pd.DataFrame.iteritems() : 열에 대해 순환 반복
(Iterate over DataFrame (column name, Series) pairs.)

(3) pd.DataFrame.itertuples() : 이름이 있는 튜플 (인덱스, 행, 열) 에 대해 순환 반복

(Iterate over DataFrame rows as namedtuples)

[ Pandas DataFrame의 행, 열, (행, 열) 튜플 순환 반복 ]

(1) DataFrame.iterrows() : 행에 대해 순환 반복
(Iterate over DataFrame rows as (index, Series) pairs.)

먼저 pandas 모듈을 importing 하고, 예제로 사용할 2개의 칼럼과 인덱스를 가진 간단한 DataFrame을 만들어보겠습니다.

import pandas as pd

df = pd.DataFrame(
    {'price': [100, 200, 300],
     'weight': [20.3, 15.1, 25.9]},
    index=['idx_a', 'idx_b', 'idx_c'])

df

	price	weight
idx_a	100	20.3
idx_b	200	15.1
idx_c	300	25.9

이제 DataFrame.iterrows() 메소드와 for loop 반복문을 사용해서 행(row)에 대해서 순환하면서 인덱스 이름과 각 행별 칼럼별 데이터를 출력해보겠습니다.

## DataFrame.iterrows()
for idx, row in df.iterrows():
    print("** index name:", idx)
    print(row)
    print("------"*5)

[Out]
** index name: idx_a
price     100.0
weight     20.3
Name: idx_a, dtype: float64
------------------------------
** index name: idx_b
price     200.0
weight     15.1
Name: idx_b, dtype: float64
------------------------------
** index name: idx_c
price     300.0
weight     25.9
Name: idx_c, dtype: float64
------------------------------

DataFrame에 여러개의 칼럼이 있고, 이중에서 특정 칼럼에 대해서만 행을 순회하면서 행별 특정 칼럼의 값을 반복해서 출력하고 싶으면 row['column_name'] 또는 row[position_int] 형식으로 특정 칼럼의 이름이나 위치 정수를 넣어주면 됩니다.

## accessing to column of each rows by indexing
for idx, row in df.iterrows():
    print(idx)
    print(row['price']) # or print(row[0])
    print("-----")

[Out]
idx_a
100.0
-----
idx_b
200.0
-----
idx_c
300.0
-----

DataFrame.iterrows() 메소드는 결과물로 (index, Series) 짝(pairs)을 반환합니다. 따라서 원본 DataFrame에서의 데이터 유형일 보존하지 못하므로 행별 Series 에서는 데이터 유형이 달라질 수 있습니다.

가령, 예제의 DataFrame에서 'price' 칼럼의 데이터 유형은 '정수형(integer64)' 인데 반해, df.iterrows() 로 반환된 'row['price']'의 데이터 유형은 '부동소수형(float64)'으로 바뀌었습니다.

## DataFrame.iterrows() returns a Series for each row,
## it does not preserve dtypes across the rows.
print('Data type of df price:', df['price'].dtype) # int
print('Data type of row price:', row['price'].dtype) # float

[Out]
Data type of df price: int64
Data type of row price: float64

(2) DataFrame.iteritems() : 열에 대해 순환 반복
(Iterate over DataFrame (column name, Series) pairs.)

위의 (1)번이 DataFrame의 행(row)에 대해 순환 반복을 했다면, 이번에는 pandas DataFrame의 열(column)에 대해 iteritems() 메소드와 for loop 문을 사용해 순환 반복(iteration) 하면서 '칼럼 이름 (column name)' 과 '행별 값 (Series for each row)' 을 짝으로 하여 출력해 보겠습니다.

	price	weight
idx_a	100	20.3
idx_b	200	15.1
idx_c	300	25.9

for col, item in df.iteritems():
    print("** column name:", col)
    print(item) # = print(item, sep='\n')
    print("-----"*5)

[Out]
** column name: price
idx_a    100
idx_b    200
idx_c    300
Name: price, dtype: int64
-------------------------
** column name: weight
idx_a    20.3
idx_b    15.1
idx_c    25.9
Name: weight, dtype: float64
-------------------------

만약 DataFrame.iteritems() 와 for loop 문으로 열(column)에 대해 순환 반복하여 각 행(row)의 값을 출력하는 중에 특정 행만을 출력하고 싶으면 '행의 위치 정수(position index of row)'나 '행의 인덱스 이름 (index name of row)' 으로 item 에서 인덱싱해주면 됩니다.

for col, item in df.iteritems():
print(col)
print(item[0]) # = print(item['idx_a'])

[Out]
price
100
weight
20.3

(3) DataFrame.itertuples() : 이름이 있는 튜플 (인덱스, 행, 열) 에 대해 순환 반복

(Iterate over DataFrame rows as namedtuples)

위의 (1) 번의 DataFrame.iterrows() 에서는 DataFrame의 행(row)에 대해 순환 반복, (2) 번의 DataFrame.iteritems() 에서는 열(column, item)에 대해 순환 반복하였습니다. 반면에, 경우에 따라서는 (인덱스, 행, 열) 의 튜플 묶음 단위로 순환 반복을 하고 싶을 때 DataFrame.itertuples() 메소드를 사용할 수 있습니다.

각 행과 열에 대해서 순환 반복하면서 값을 가져오고, 이를 zip() 해서 묶어주는 번거로운 일을 DataFrame.itertuples() 메소드는 한번에 해주니 알아두면 매우 편리한 메소드입니다.

아래의 예는 DataFrame.itertuples() 메소드와 for loop 문을 사용해서 'df' DataFrame의 이름있는 튜플인 namedtuple (Index, row, column) 에 대해서 순환 반복하면서 출력을 해보겠습니다.

	price	weight
idx_a	100	20.3
idx_b	200	15.1
idx_c	300	25.9

for row in df.itertuples():
print(row)

[Out] 
Pandas(Index='idx_a', price=100, weight=20.3)
Pandas(Index='idx_b', price=200, weight=15.1)
Pandas(Index='idx_c', price=300, weight=25.9)

만약 인덱스를 포함하고 싶지 않다면 index=False 로 매개변수를 설정해주면 됩니다.

## By setting the indx=False, we can remove the index as the first element of the tuple.
for row in df.itertuples(index=False):
print(row)

[Out] 
Pandas(price=100, weight=20.3)
Pandas(price=200, weight=15.1)
Pandas(price=300, weight=25.9)

DataFrame.itertuples() 메소드가 이름있는 튜플(namedtuples)을 반환한다고 했는데요, name 매개변수로 튜플의 이름을 부여할 수도 있습니다. 아래 예에서는 name='Product' 로 해서 튜플에 'Product'라는 이름을 부여해보았습니다.

## Setting a custom name for the yielded namedtuples.
for row in df.itertuples(name='Product'):
print(row)

[Out]
Product(Index='idx_a', price=100, weight=20.3)
Product(Index='idx_b', price=200, weight=15.1)
Product(Index='idx_c', price=300, weight=25.9)

DataFrame.iterrows() 는 (index, Series) 짝을 반환하다보니 원본 DataFrame의 데이터 유형을 보존하지 못한다고 했는데요, DataFrame.itertuples() 는 원본 DataFrame의 데이터 유형을 그대로 보존합니다.

아래 예에서 볼 수 있듯이 df['price']의 데이터 유형과 df.itertuples()의 결과의 row.price 의 데이터 유형이 둘 다 '정수(int64)'로 동일합니다.

## DataFrame.itertuples() preserves dtypes, returning namedtuples of the values.
print('Data type of df price:', df['price'].dtype) # int
print('Data type of row price:', type(row.price)) # int

[Out] 
Data type of df price: int64
Data type of row price: <class 'int'>

[Reference]

* DataFrame.iterrows(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows

* DataFrame.iteritems(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iteritems.html

* DataFrame.itertuples(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples

이번 포스팅이 많은 도움이 되었기를 바랍니다.

행복한 데이터 과학자 되세요! :-)

728x90

저작자표시 비영리 변경금지 (새창열림)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 범주형 자료 결측값을 범주별 구성비율의 확률로 채우기 (29)	2021.02.06
[Python pandas] DataFrame에서 무작위(확률, 임의) 표본 추출하기: DataFrame.sample() (0)	2021.01.16
[Python pandas] 그룹별로 전 분기 대비, 전년 동분기 대비 변동률 구하기 (Percentage change between the current and a prior element by Group) (2)	2020.12.26
[Python numpy] numpy 배열에서 특정 형상의 빈 자리를 0으로 채우기 (padding) (6)	2020.09.11
[Python] Numpy 희소행렬을 SciPy 압축 희소 열 행렬 (Compressed sparse row matrix)로 변환하기 (3)	2020.08.09

Posted by Rfriend

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python pandas] DataFrame의 행, 열, (행, 열) 튜플 순환 반복하기 (iterate over pandas DataFrame rows, columns, tuple(index, row, columns))

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바