R, Python 분석과 프로그래밍의 친구 (by R Friend)

'pd.concat()'에 해당되는 글 2건

2016.11.30 [Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() 3
2016.11.28 [Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() 2

[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 30. 23:46

지난번 포스팅에서는 DataFrame을 Python pandas 라이브러리의 pd.concat() 함수를 사용해서 상+하로 합치기, 좌+우로 합치기를 해보았습니다.

이번 포스팅에서는 이어서 DataFrame과 Series를 pd.concat() 함수, append() 함수를 사용해서 합치기를 소개하겠습니다.

DataFrame 끼리 합치기 대비 해서 DataFrame + Series 가 index 관련해서 좀 헷갈리는게 있습니다만, 아래의 간단한 예시를 참고하면 어렵지 않게 이해할 수 있을 것입니다.

pandas, DataFrame, Series importing 부터 시작해 보시죠.

# importing libraries

In [1]: import pandas as pd

...: from pandas import DataFrame

...: from pandas import Series

(1) DataFrame에 Series '좌+우'로 합치기 : pd.concat([df, Series], axis=1)

DataFrame과 Series가 합쳐지면 DataFrame이 됩니다. axis=1 을 설정하면 '좌+우' 형태로 열(column)이 오른쪽 옆으로 늘어납니다.

새로 합쳐지는 DataFrame의 열 이름(column name)을 유심히 살펴보세요. Series의 이름(name)이 새로운 DataFrame의 변수 이름이 됩니다.

In [2]: df_1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=[0, 1, 2])

In [3]: df_1

Out[3]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [4]: Series_1 = pd.Series(['S1', 'S2', 'S3'], name='S')

In [5]: Series_1

Out[5]:

0    S1
1    S2
2    S3
Name: S, dtype: object

# Concatenating DataFrame and Series along columns (from left to right)

# concatenated column name of the new DataFrame will be the same name of Series

In [6]: pd.concat([df_1, Series_1], axis=1)

Out[6]:

A B C D S
0 A0 B0 C0 D0 S1
1 A1 B1 C1 D1 S2
2 A2 B2 C2 D2 S3

(2) DataFrame에 Series를 '좌+우'로 합칠 때

열 이름(column name) 무시하고 정수 번호 자동 부여 : ignore_index=True

In [7]: pd.concat([df_1, Series_1], axis=1, ignore_index=True)

Out[7]:

0 1 2 3 4
0 A0 B0 C0 D0 S1
1 A1 B1 C1 D1 S2
2 A2 B2 C2 D2 S3

(3) Series 끼리 '좌+우'로 합치기 : pd.concat([Series1, Series2, ...], axis=1)

만약 Series의 이름(name)이 있으면 합쳐진 DataFrame의 열 이름(column name)으로 사용됩니다. Series에 이름이 없다면 정수 0, 1, 2, ... 가 자동 부여 됩니다.

In [8]: Series_1 = pd.Series(['S1', 'S2', 'S3'], name='S')

In [9]: Series_2 = pd.Series([0, 1, 2]) # without name

In [10]: Series_3 = pd.Series([3, 4, 5]) # without name

In [11]: Series_1

Out[11]:

0    S1
1    S2
2    S3

Name: S, dtype: object

In [12]: Series_2

Out[12]:

0    0
1    1
2    2
dtype: int64

In [13]: Series_3

Out[13]:

0    3
1    4
2    5
dtype: int64

# name of Series will be used as the column name of concatenated DataFrame

In [14]: pd.concat([Series_1, Series_2, Series_3], axis=1)

Out[14]:

S 0 1
0 S1 0 3
1 S2 1 4
2 S3 2 5

(4) Series 끼리 합칠 때 열 이름(column name) 덮어 쓰기 : keys = ['xx', 'xx', ...]

In [15]: pd.concat([Series_1, Series_2, Series_3], axis=1, keys=['C0', 'C1', 'C1'])

Out[15]:

   C0 C1 C1
0 S1   0   3
1 S2   1   4
2 S3   2   5

(5) DataFrame에 Series를 '위+아래'로 합치기 : df.append(Series, ignore_index=True)

ignore_index=True 를 설정해주도록 합니다.

In [16]: df_1

Out[16]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [17]: Series_4 = pd.Series(['S1', 'S2', 'S3', 'S4'], index=['A', 'B', 'C', 'E'])

In [18]: Series_4

Out[18]:

A    S1
B    S2
C    S3
E    S4
dtype: object

In [19]: df_1.append(Series_4, ignore_index=True)

Out[19]:

    A    B   C    D    E
0 A0 B0 C0   D0 NaN
1 A1 B1 C1   D1 NaN
2 A2 B2 C2   D2 NaN
3 S1 S2 S3 NaN   S4

ignore_index=True 를 설정해주지 않으면 아래처럼 'TypeError' 가 발생합니다.

In [20]: df_1.append(Series_4) # TypeError without 'ignore_index=True'

Traceback (most recent call last):

File "<ipython-input-20-ca24d6ef8563>", line 1, in <module>

df_1.append(Series_4) # TypeError without 'ignore_index=True'

File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4314, in append

raise TypeError('Can only append a Series if ignore_index=True'

TypeError: Can only append a Series if ignore_index=True or if the Series has a name

이번 포스팅이 도움이 되었다면 아래의 '공감~♡'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index) (3)	2016.12.06
[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03
[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() (2)	2016.11.28
[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27
[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection) (2)	2016.11.27

Posted by Rfriend

[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 28. 23:47

분석을 하다보면 여기저기 흩어져 있는 여러 개의 데이터 테이블을 모아서 합쳐야 하는 일이 생기곤 합니다. 나를 대신해서 누군가가 데이터 전처리를 해주지 않는다고 했을 때는 말이지요.

정규화해서 Database 관리를 하는 곳이라면 주제별로 Data Entity를 구분해서 여러 개의 Table들로 데이터가 나뉘어져 있을 것입니다.

특히, 데이터의 속성 형태가 동일한 데이터셋(homogeneously-typed objects)끼리 합칠 때 사용할 수 있는 pandas의 DataFrame 합치는 방법(concatenating DataFrames)으로 이번 포스팅에서는 pd.concat() 함수를 소개하겠습니다.

(R의 rbind(), cbind() 와 유사함)

pd.concat() 의 parameter 값들의 default setting은 아래와 같습니다. 하나씩 예를 들어가면서 소개하겠습니다.

pd.concat(objs, # Series, DataFrame, Panel object

axis=0, # 0: 위+아래로 합치기, 1: 왼쪽+오른쪽으로 합치기

join='outer', # 'outer': 합집합(union), 'inner': 교집합(intersection)

~~join_axes=None, # axis=1 일 경우 특정 DataFrame의 index를 그대로 이용하려면 입력 (deprecated, 더이상 지원하지 않음)~~

ignore_index=False, # False: 기존 index 유지, True: 기존 index 무시
keys=None, # 계층적 index 사용하려면 keys 튜플 입력

levels=None,

names=None, # index의 이름 부여하려면 names 튜플 입력

verify_integrity=False, # True: index 중복 확인
copy=True) # 복사

(1-1) 위 + 아래로 DataFrame 합치기(rbind) : axis = 0

# importing libraries

In [1]: import pandas as pd

...: from pandas import DataFrame

# making DataFrames

In [2]: df_1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=[0, 1, 2])

In [3]: df_2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],

...: 'B': ['B3', 'B4', 'B5'],

...: 'C': ['C3', 'C4', 'C5'],

...: 'D': ['D3', 'D4', 'D5']},

...: index=[3, 4, 5])

In [4]: df_1

Out[4]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [5]: df_2

Out[5]:

A B C D
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5

# concatenating DataFrame1, 2 along rows, axis=0, default

In [6]: df_12_axis0 = pd.concat([df_1, df_2]) # row bind : axis = 0, default

In [7]: df_12_axis0

Out[7]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5

(1-2) 왼쪽 + 오른쪽으로 DataFrame 합치기(cbind) : axis = 1

In [8]: df_3 = pd.DataFrame({'E': ['A6', 'A7', 'A8'],

...: 'F': ['B6', 'B7', 'B8'],

...: 'G': ['C6', 'C7', 'C8'],

...: 'H': ['D6', 'D7', 'D8']},

...: index=[0, 1, 2])

In [9]: df_1

Out[9]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [10]: df_3

Out[10]:

E F G H
0 A6 B6 C6 D6
1 A7 B7 C7 D7
2 A8 B8 C8 D8

# concatenating DataFrames along columns, axis=1

In [11]: df_13_axis1 = pd.concat([df_1, df_3], axis=1) # column bind

In [12]: df_13_axis1

Out[12]:

A B C D E F G H
0 A0 B0 C0 D0 A6 B6 C6 D6
1 A1 B1 C1 D1 A7 B7 C7 D7
2 A2 B2 C2 D2 A8 B8 C8 D8

(2-1) 합집합(union)으로 DataFrame 합치기 : join = 'outer'

In [13]: df_4 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'E': ['E0', 'E1', 'E2']},

...: index=[0, 1, 3])

In [17]: df_1

Out[17]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [18]: df_4

Out[18]:

A B C E
0 A0 B0 C0 E0
1 A1 B1 C1 E1
3 A2 B2 C2 E2

In [19]: df_14_outer = pd.concat([df_1, df_4], join='outer') # union, default

In [20]: df_14_outer

Out[20]:

     A   B   C    D    E
0 A0 B0 C0   D0 NaN
1 A1 B1 C1   D1 NaN
2 A2 B2 C2   D2 NaN
0 A0 B0 C0 NaN   E0
1 A1 B1 C1 NaN   E1
3 A2 B2 C2 NaN   E2

(2-2) 교집합(intersection)으로 DataFrame 합치기 : join = 'inner'

In [21]: df_14_inner = pd.concat([df_1, df_4], join='inner') # intersection

In [22]: df_14_inner

Out[22]:

A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
0 A0 B0 C0
1 A1 B1 C1
3 A2 B2 C2

(3) axis=1일 경우 특정 DataFrame의 index를 그대로 이용하고자 할 경우 : join_axes

아래에 axis=1 (왼쪽+오른쪽) 인 경우, join='outer', join='inner', join_axes=[df.index] 의 3개 방법을 소개하였습니다. 합쳐진 DataFrame의 index 를 유심히 비교해보시기 바랍니다.

In [23]: df_1

Out[23]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [24]: df_4

Out[24]:

A B C E
0 A0 B0 C0 E0
1 A1 B1 C1 E1
3 A2 B2 C2 E2

# comparison 1
In [25]: df_14_outer_axis1 = pd.concat([df_1, df_4], join='outer', axis=1) # default

In [26]: df_14_outer_axis1

Out[26]:

      A    B    C    D    A    B    C    E
0   A0   B0   C0   D0   A0   B0   C0   E0
1   A1   B1   C1   D1   A1   B1   C1   E1
2   A2   B2   C2   D2 NaN NaN NaN NaN
3 NaN NaN NaN NaN   A2   B2   C2   E2

# comparison 2

In [29]: df_14_inner_axis1 = pd.concat([df_1, df_4], join='inner', axis=1)

In [30]: df_14_inner_axis1

Out[30]:

A B C D A B C E
0 A0 B0 C0 D0 A0 B0 C0 E0
1 A1 B1 C1 D1 A1 B1 C1 E1

# reuse the exact index from the original DataFrame : reindex()

In [31]: df_14_axis1_reindex = pd.concat([df_1, df_4], axis=1).reindex(df_1.index)

In [32]: df_14_axis1_reindex

Out[32]:

   A   B   C   D    A    B    C    E
0 A0 B0 C0 D0   A0   B0   C0   E0
1 A1 B1 C1 D1   A1   B1   C1   E1
2 A2 B2 C2 D2 NaN NaN NaN NaN

* (참고) 최신버전의 pandas를 사용하면서 join_axes 매개변수를 사용한다면 아래와 같은 TypeError: concat() got an unexpected keyword argument 'join_axes' 에러 메시지가 뜰 것입니다. 본 블로그를 2016년도에 썼다보니 그동안 pandas 매개변수 업데이터된 내용을 블로그 포스팅에 미처 반영 못한 부분이 있었습니다. (본문 바로잡을 수 있도록 댓글 남겨주신 김명찬님 감사합니다.)

join_axes 매개변수는 사용이 중단되었네요.(join_axes is deprecated.) 대신에 위의 In [32]의 예에서처럼 reindex() 를 사용해서 기존의 index를 재사용할 수 있습니다.

pd.concat([df_1, df_4], join_szes=[df_1.index], axis=1)

------------------------------------------------------------

Type Error Traceback (most recent call last)

<ipython-input-20-748c1e0a3504> in <module>

----> 1 df_14_join_axes_axis1 = pd.concat([df_1, df_4], join_axes=[df_1.index], axis=1)

TypeError: concat() got an unexpected keyword argument 'join_axes'

(4) 기존 index를 무시하고 싶을 때 : ignore_index

In [33]: df_5 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=['r0', 'r1', 'r2'])

In [34]: df_6 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],

...: 'B': ['B3', 'B4', 'B5'],

...: 'C': ['C3', 'C4', 'C5'],

...: 'D': ['D3', 'D4', 'D5']},

...: index=['r3', 'r4', 'r5'])

In [35]: df_56_with_index = pd.concat([df_5, df_6], ignore_index=False) # default

In [36]: df_56_with_index

Out[36]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2
r3 A3 B3 C3 D3
r4 A4 B4 C4 D4
r5 A5 B5 C5 D5

# if you want ignore current index, use 'ignore_index=True'

In [37]: df_56_ignore_index = pd.concat([df_5, df_6], ignore_index=True)# index 0~(n-1)

In [38]: df_56_ignore_index

Out[38]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5

(5) 계층적 index (hierarchical index) 만들기 : keys

# concatenating DataFrames : Construct hierarchical index using 'keys'

In [40]: df_56_with_keys = pd.concat([df_5, df_6], keys=['df_5', 'df_6'])

In [41]: df_56_with_keys

Out[41]:

            A   B   C   D
df_5 r0 A0 B0 C0 D0
       r1 A1 B1 C1 D1
         r2 A2 B2 C2 D2
df_6 r3 A3 B3 C3 D3
       r4 A4 B4 C4 D4
         r5 A5 B5 C5 D5

참고로, 계층적 index를 가지고 indexing 하는 방법을 아래에 예를 들어 소개하겠습니다. 'df_56_with_keys' DataFrame은 index가 1층, 2층으로 계층을 이루고 있으므로 indexing 할 때 1층용 index와 2층용 index를 따로 따로 사용하면 됩니다. 아래 예시를 참고하세요.

In [42]: df_56_with_keys.loc['df_5']

Out[42]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2

In [43]: df_56_with_keys.loc['df_5'][0:2]

Out[43]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1

(6) index에 이름 부여하기 : names

In [44]: df_56_with_name = pd.concat([df_5, df_6],

...: keys=['df_5', 'df_6'],

...: names=['df_name', 'row_number'])

In [45]: df_56_with_name

Out[45]:

                                 A   B   C   D
df_name row_number
df_5        r0                A0 B0 C0 D0
              r1                A1 B1 C1 D1
              r2                A2 B2 C2 D2
df_6         r3                A3 B3 C3 D3
              r4                A4 B4 C4 D4
              r5                A5 B5 C5 D5

(7) index 중복 여부 점검 : verify_integrity

df_7, df_8 DataFrame에 'r2' index를 중복으로 포함시킨 후에 pd.concat() 을 적용해보겠습니다. verify_integrity=False (디폴트이므로 별도 입력 안해도 됨) 에서는 아무 에러 메시지 없이 위+아래로 잘 합쳐집니다 ('r2' index가 위+아래로 2번 중복해서 나타남). 반면에, verify_integrity=True 를 설정해주면 만약 index 중복이 있을 경우 'ValueError: Indexes have overlapping values: xxx' 에러 메시지가 뜨면서 합치기가 아예 안됩니다.

In [48]: df_7 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=['r0', 'r1', 'r2'])

...:

In [49]: df_8 = pd.DataFrame({'A': ['A2', 'A3', 'A4'],

...: 'B': ['B2', 'B3', 'B4'],

...: 'C': ['C2', 'C3', 'C4'],

...: 'D': ['D2', 'D3', 'D4']},

...: index=['r2', 'r3', 'r4'])

In [50]: df_7

Out[50]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2

In [51]: df_8

Out[51]:

A B C D
r2 A2 B2 C2 D2
r3 A3 B3 C3 D3
r4 A4 B4 C4 D4

# concatenating DataFrames without overlap checking : verify_integrity=False

In [52]: df_78_F_verify_integrity = pd.concat([df_7, df_8],

...: verify_integrity=False) # default

In [53]: df_78_F_verify_integrity

Out[53]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2
r2 A2 B2 C2 D2
r3 A3 B3 C3 D3
r4 A4 B4 C4 D4

# index overlap checking, using verify_integrity=True

In [54]: df_78_T_verify_integrity = pd.concat([df_7, df_8],

...: verify_integrity=True)

Traceback (most recent call last):

File "<ipython-input-56-5512ad3b5016>", line 2, in <module>
verify_integrity=True)

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 845, in concat
copy=copy)

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 984, in __init__
self.new_axes = self._get_new_axes()

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 1073, in _get_new_axes
new_axes[self.axis] = self._get_concat_axis()

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 1132, in _get_concat_axis
self._maybe_check_integrity(concat_axis)

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 1141, in _maybe_check_integrity
% str(overlap))

ValueError: Indexes have overlapping values: ['r2']

많은 도움이 되었기를 바랍니다.

도움이 되었다면 아래의 '공감 ~♡'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03
[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() (3)	2016.11.30
[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27
[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection) (2)	2016.11.27
[Python pandas] pd.DataFrame 만들고 Attributes 조회하기 (0)	2016.11.26

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'pd.concat()'에 해당되는 글 2건

[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바