'파이썬' 태그의 글 목록 (15 Page)

[Python pandas] 결측값 채우기, 결측값 대체하기, 결측값 처리 (filling missing value, imputation of missing values) : df.fillna()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 9. 23:28

지난번 포스팅에서는 결측값 여부 확인, 결측값 개수 세기 등을 해보았습니다.

이번 포스팅에서는 결측값을 채우고 대체하는 다양한 방법들로서,

(1) 결측값을 특정 값으로 채우기 (replace missing valeus with scalar value)

(2) 결측값을 앞 방향 혹은 뒷 방향으로 채우기 (fill gaps forward or backward)

(3) 결측값 채우는 회수를 제한하기 (limit the amount of filling)

(4) 결측값을 변수별 평균으로 대체하기 (filling missing values with mean values per columns)

(5) 결측값을 다른 변수의 값으로 대체하기 (filling missing values with another columns' values)

등에 대해서 알아보도록 하겠습니다.

모델링에 들어가기 전에 결측값을 확인하고 결측값을 처리하는 절차가 반드시 필요하고 매우 중요한 부분입니다.

먼저 필요한 pandas, numpy 모듈을 불러오고, 결측값(missing values)을 포함하고 있는 DataFrame을 만들어보겠습니다. 결측값은 None 또는 np.nan 을 할당해주면 됩니다.

import numpy as np
import pandas as pd

## making a sample DataFrame
df = pd.DataFrame(np.random.randn(5, 3),
                  columns=['C1', 'C2', 'C3'])
                  
df

#          C1        C2        C3
# 0 -1.245527 -0.332396 -1.310678
# 1 -2.007546 -0.747846 -0.704082
# 2 -2.237326 -1.065539  1.086993
# 3 -1.179056  1.372350  0.388981
# 4  1.070266 -1.545383  0.996416


## replacing some values with None
df.loc[0, 'C1'] = np.nan
df.loc[1, 'C1'] = np.nan
df.loc[1, 'C3'] = np.nan
df.loc[2, 'C2'] = np.nan
df.loc[3, 'C2'] = np.nan
df.loc[4, 'C3'] = np.nan


df

#          C1        C2        C3
# 0       NaN  0.893384  2.388055
# 1       NaN  0.012155       NaN
# 2  0.099263       NaN -1.330312
# 3  2.468724       NaN  1.046972
# 4 -1.157316  1.465829       NaN

(1) 결측값을 특정 값으로 채우기 (replace missing values with scalar value) : df.fillna(0)

결측값을 '0' 으로 대체해보겠습니다.

## (1) 결측값을 특정 값으로 채우기
## (1-1) 결측값을 숫자 0 으로 채우기
df_0 = df.fillna(0)

print(df_0)

#          C1        C2        C3
# 0  0.000000  0.893384  2.388055
# 1  0.000000  0.012155  0.000000
# 2  0.099263  0.000000 -1.330312
# 3  2.468724  0.000000  1.046972
# 4 -1.157316  1.465829  0.000000

이번에는 결측값을 'missing' 이라는 string 값으로 채워보겠습니다.

## (1-2) 결측값을 문자열 'missing'으로 채우기
df_missing = df.fillna('missing')

print(df_missing)

#           C1         C2       C3
# 0    missing   0.893384  2.38805
# 1    missing  0.0121551  missing
# 2  0.0992629    missing -1.33031
# 3    2.46872    missing  1.04697
# 4   -1.15732    1.46583  missing

(2) 결측값을 앞 방향 혹은 뒷 방향으로 채우기 (fill gaps forward or backward)
: fillna(method='ffill' or 'pad'), fillna(method='bfill' or 'backfill')

(2-) 결측값을 앞 방향으로 채워나가려면(fill gaps forward) fillna(method='ffill') 혹은 fillna(method='pad') 를 사용하면 됩니다.

## (2) 결측값을 앞 방향 혹은 뒷 방향으로 채우기

print(df)

#          C1        C2        C3
# 0       NaN  0.893384  2.388055
# 1       NaN  0.012155       NaN
# 2  0.099263       NaN -1.330312
# 3  2.468724       NaN  1.046972
# 4 -1.157316  1.465829       NaN


## (2-1) 결측값을 위에서 아래 방향으로 채우기 (forward filling)
df.fillna(method='ffill') 
#df.fillna(method='pad') # or equivalently

#          C1        C2        C3
# 0       NaN  0.893384  2.388055
# 1       NaN  0.012155  2.388055
# 2  0.099263  0.012155 -1.330312
# 3  2.468724  0.012155  1.046972
# 4 -1.157316  1.465829  1.046972

(2-2) 결측값을 뒷 방향으로 채워나가기 위해 fillna(method='bfill') 혹은 fillna(method='backfill')을 사용

## (2-2) 결측값을 아래에서 위 방향으로 채우기 (backward filling)

print(df)

#          C1        C2        C3
# 0       NaN  0.893384  2.388055
# 1       NaN  0.012155       NaN
# 2  0.099263       NaN -1.330312
# 3  2.468724       NaN  1.046972
# 4 -1.157316  1.465829       NaN


df.fillna(method='bfill')
#df.fillna(method='backfill') # or equivalently

#          C1        C2        C3
# 0  0.099263  0.893384  2.388055
# 1  0.099263  0.012155 -1.330312
# 2  0.099263  1.465829 -1.330312
# 3  2.468724  1.465829  1.046972
# 4 -1.157316  1.465829       NaN

(3) 앞/뒤 방향으로 결측값 채우는 회수를 제한하기 (limit the amount of filling)
: fillna(method='ffill', limit=number), fillna(method='bfill', limit=number)

앞 방향이나 뒷 방향으로 채워나갈 때 fillna(limit=1) 를 사용해서 결측값 채우는 '개수'를 '1'개로 한정해 보겠습니다.

시계열 데이터 분석할 때 유용하게 사용하는 기능 중의 하나입니다.

## (3) 앞/뒤 방향으로 결측값 채우는 회수를 제한하기 (limit the amount of filling)

df.fillna(method='ffill', limit=1) # fill values forward with limit

#          C1        C2        C3
# 0       NaN  0.893384  2.388055
# 1       NaN  0.012155  2.388055
# 2  0.099263  0.012155 -1.330312
# 3  2.468724       NaN  1.046972
# 4 -1.157316  1.465829  1.046972


df.fillna(method='bfill', limit=1) # fill values backward with limit

#          C1        C2        C3
# 0       NaN  0.893384  2.388055
# 1  0.099263  0.012155 -1.330312
# 2  0.099263       NaN -1.330312
# 3  2.468724  1.465829  1.046972
# 4 -1.157316  1.465829       NaN

(4) 결측값을 변수별 평균으로 대체하기(filling missing values with mean per columns)
: df.fillna(df.mean()), df.where(pd.notnull(df), df.mean(), axis='columns')

## (4) 결측값을 변수별 평균으로 대체하기

## mean values per columns
df.mean()

# C1    0.470224
# C2    0.790456
# C3    0.701572
# dtype: float64


## filling missing values with mean per columns
df.fillna(df.mean())
#df.where(pd.notnull(df), df.mean(), axis='columns') # or equivalently

#          C1        C2        C3
# 0  0.470224  0.893384  2.388055
# 1  0.470224  0.012155  0.701572
# 2  0.099263  0.790456 -1.330312
# 3  2.468724  0.790456  1.046972
# 4 -1.157316  1.465829  0.701572

위의 예시는 각 칼럼의 평균으로 -> 각 칼럼의 결측값을 대체하는 방식이었습니다.

아래 예시는 'C1' 칼럼의 평균을 가지고 'C1', 'C2', 'C3' 칼럼의 결측값을 대체하는 방법입니다.

##  'C1' 칼럼의 평균을 가지고 'C1', 'C2', 'C3' 칼럼의 결측값을 대체하는 방법
df.mean()['C1']

# 0.4702236218547502


df.fillna(df.mean()['C1'])

#          C1        C2        C3
# 0  0.470224  0.893384  2.388055
# 1  0.470224  0.012155  0.470224
# 2  0.099263  0.470224 -1.330312
# 3  2.468724  0.470224  1.046972
# 4 -1.157316  1.465829  0.470224

아래의 예시는 'C1'칼럼과 'C2' 칼럼에 대해서만 각 칼럼의 평균을 가지고 -> 각 칼럼에 있는 대체값을 대체하는 경우입니다. 좀 헷갈릴 수 있는데요, 위의 2개의 예시의 혼합 형태라고 보시면 되겠습니다.

('C3'의 NaN 값은 결측값 그대로 있습니다.)

## 'C1'칼럼과 'C2' 칼럼에 대해서만 각 칼럼의 평균을 가지고 -> 각 칼럼에 있는 결측값을 대체

## mean values of 'C1', 'C2'
df.mean()['C1':'C2']

# C1    0.470224
# C2    0.790456
# dtype: float64


## filling the missing value of 'C1', 'C2' with each mean values
df.fillna(df.mean()['C1':'C2'])

#          C1        C2        C3
# 0  0.470224  0.893384  2.388055
# 1  0.470224  0.012155       NaN
# 2  0.099263  0.790456 -1.330312
# 3  2.468724  0.790456  1.046972
# 4 -1.157316  1.465829       NaN

(5) 결측값을 다른 변수의 값으로 대체하기
(filling missing values with another columns' values)

두가지 방법이 있는데요, 먼저 np.where()와 pd.notnumm() 를 사용해서 np.where(pd.notnull(df['C2']) == True, df['C2'], df['C1']) 처럼 'C2' 칼럼에서 결측값이 없으면 'C2' 칼럼의 값을 그대로 사용하고, 'C2'칼럼에 결측값이 있으면 'C1' 칼럼의 값을 가져다가 결측값을 채워보겠습니다.

## (5) 결측값을 다른 변수의 값으로 대체하기
##    (filling missing values with another columns' values)

df_2 = pd.DataFrame({
    'C1': [1, 2, 3, 4, 5], 
    'C2': [6, 7, 8, 9, 10]})

## put NaNs as an example
df_2.loc[[1, 3], ['C2']] = np.nan

df_2

#    C1    C2
# 0   1   6.0
# 1   2   NaN
# 2   3   8.0
# 3   4   NaN
# 4   5  10.0


## making new column by filling missing values with another column's value
# way 1 : by np.where => quick
df_2['C2_New'] = np.where(pd.notnull(df_2['C2']) == True, 
                          df_2['C2'], df_2['C1'])

df_2

#    C1    C2  C2_New
# 0   1   6.0     6.0
# 1   2   NaN     2.0
# 2   3   8.0     8.0
# 3   4   NaN     4.0
# 4   5  10.0    10.0

아래의 loop programming은 위와 동일한 결과를 반환하지만 시간은 훨~씬 오래 걸린다는 점 유의하시구요, 그냥 '아, 이렇게도 할 수 있구나...' 정도로만 참고하시기 바랍니다.

(당연히, 위의 np.where()와 pd.notnull() 을 사용하는 것이 속도도 빠르고 코드도 짧고 쉽기 때문에 추천)

## way 2 : by loop programming => takes a long time
for i in df_2.index:
    if pd.notnull(df_2.loc[i, 'C2']) == True:
        df_2.loc[i, 'C2_New_2'] = df_2.loc[i, 'C2']
    else:
        df_2.loc[i, 'C2_New_2'] = df_2.loc[i, 'C1']
        
        
 
df_2

#    C1    C2  C2_New  C2_New_2
# 0   1   6.0     6.0       6.0
# 1   2   NaN     2.0       2.0
# 2   3   8.0     8.0       8.0
# 3   4   NaN     4.0       4.0
# 4   5  10.0    10.0      10.0

결측값을 그룹별 평균으로 대체하기는 http://rfriend.tistory.com/402 를 참고하세요.

DataFrame 내 여러개의 칼럼별로 서로 다른 결측값 대체 전략을 사용하는 방법은 https://rfriend.tistory.com/542 를 참고하세요.

결측값을 선형회귀모형 추정값으로 대체하는 방법은 rfriend.tistory.com/636 를 참고하세요.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 결측값 보간하기 (interpolation of missing values) : interpolate(), interpolate(method='time'), interpolate(method='values') (6)	2016.12.10
[Python pandas] 결측값 있는 행 제거, 결측값 있는 행 제거 : dropna(axis=0), dropna(axis=1) (4)	2016.12.10
[Python pandas] 결측값 연산 (calculations with missing data) (0)	2016.12.08
[Python pandas] DataFrame 결측값 여부 확인, 결측값 개수 : isnull(), notnull(), df.isnull().sum(), df.notnull().sum(), df.isnull().sum(1), df.notnull().sum(1) (0)	2016.12.07
[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index) (3)	2016.12.06

Posted by Rfriend

,

[Python pandas] 결측값 연산 (calculations with missing data)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 8. 22:34

지난번 포스팅에서는 Python pandas DataFrame 에서

- 결측값이 있는지 여부 확인 : isnull(), notnull()

- 열별 결측값 개수 : df.isnull().sum()

- 행별 결측값 개수 : df.isnull().sum(1)

확인하는 방법을 소개하였습니다. (바로가지 ☞ http://rfriend.tistory.com/260)

이번 포스팅에서는 결측값 연산 (calculations with missing data) 에 대해서 알아보겠습니다.

먼저, 필요한 모듈 importing 하고, 결측값 들어있는 DataFrame을 만들어보겠습니다.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas import DataFrame

In [4]: df = DataFrame(np.arange(10).reshape(5,2),
   ...: index=['a', 'b', 'c', 'd', 'e'],
   ...: columns=['C1', 'C2'])
   ...:

In [5]: df
Out[5]:
   C1 C2
a   0   1
b   2   3
c   4   5
d   6   7
e   8   9

In [6]: df.loc[['b', 'e'], ['C1']] = None

In [7]: df.loc[['b', 'c'], ['C2']] = None

In [8]: df
Out[8]:
    C1   C2
a 0.0 1.0
b NaN NaN
c 4.0 NaN
d 6.0 7.0
e NaN 9.0

(1) sum(), cumsum() methods 계산 시 : NaN은 '0'으로 처리

# When summing data, NA (missing) values will be treated as zero
In [9]: df.sum() # sum by columns
Out[9]:
C1    10.0
C2    17.0
dtype: float64

In [10]: df['C1'].sum() # sum of 'C1' column
Out[10]: 10.0

In [11]: df['C1'].cumsum() # cumulative sum of 'C1' column
Out[11]:
a     0.0
b     NaN
c     4.0
d    10.0
e     NaN
Name: C1, dtype: float64

(2) mean(), std() 연산 시 : NaN은 분석 대상(population)에서 제외

In [8]: df
Out[8]:
    C1   C2
a 0.0 1.0
b NaN NaN
c 4.0 NaN
d 6.0 7.0
e NaN 9.0

# As for mean(), std() methods, NA (missing) values will be ignored
In [12]: df.mean() # mean by columns
Out[12]:
C1    3.333333 # (0+4+6)/3 = 3.333333
C2    5.666667 # (1+7+9)/3 = 5.66667
dtype: float64

In [13]: df.mean(1) # mean by row
Out[13]:
a    0.5
b    NaN
c    4.0
d    6.5
e    9.0
dtype: float64

In [14]: df.std()
Out[14]:
C1    3.055050
C2    4.163332
dtype: float64

(3) DataFrame 칼럼 간 연산 시 : NaN이 하나라도 있으면 NaN 반환

아래 예시의 'C3' 칼럼에서 'c', 'e' index 값을 유심히 보시기 바랍니다. '+'을 하는 'C1'과 'C2' 칼럼의 값 중에서 하나라도 NaN 이면 'C3' 값은 NaN 입니다. (물론 'C1'과 'C2' 모두 NaN이면 이들의 합인 'C3'는 NaN이구요.)

In [8]: df
Out[8]:
    C1   C2
a 0.0 1.0
b NaN NaN
c 4.0 NaN
d 6.0 7.0
e NaN 9.0

In [15]: df['C3'] = df['C1'] + df['C2']

In [16]: df
Out[16]:
    C1   C2    C3
a 0.0 1.0   1.0
b NaN NaN   NaN
c 4.0 NaN   NaN
d 6.0 7.0 13.0
e NaN 9.0   NaN

(4) DataFrame 간 연산 : 동일한 칼럼끼리는 NaN을 '0'으로 처리하여 연산,
동일한 칼럼이 없는 경우(한쪽에만 칼럼이 있는 경우)는 모든 값을 NaN으로 반환

아래 예제의 df, df_2 의 두 개의 DataFrame에서는 'C1' 만이 동일한 (공통의) 칼럼이며, 나머지는 한쪽 DataFrame에만 있는 칼럼들로 구성이 되어 있습니다.

이들 df, df_2 두 개의 DataFrame을 '+' 하면 'C1' 만 제대로 덧셈 연산이 되고 ( NaN 은 '0'으로 처리하여 연산 ), 나머지 칼럼은 모든 값이 NaN으로 변환되었음을 알 수 있습니다.

In [17]: df_2 = DataFrame({'C1' : [1, 1, 1, 1, 1],
    ...: 'C4' : [1, 1, 1, 1, 1]},
    ...: index=['a', 'b', 'c', 'd', 'e'])
    ...:

In [18]: df
Out[18]:
    C1   C2    C3
a 0.0 1.0   1.0
b NaN NaN   NaN
c 4.0 NaN   NaN
d 6.0 7.0 13.0
e NaN 9.0   NaN

In [19]: df_2
Out[19]:
   C1 C4
a   1   1
b   1   1
c   1   1
d   1   1
e   1   1

In [20]: df + df_2
Out[20]:
    C1 C2 C3 C4
a 1.0 NaN NaN NaN
b NaN NaN NaN NaN
c 5.0 NaN NaN NaN
d 7.0 NaN NaN NaN
e NaN NaN NaN NaN

이상으로 결측값 연산을 마치도록 하겠습니다.

다음번 포스팅에서는 결측값 채우기, 결측값 대체에 대해서 알아보겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 결측값 있는 행 제거, 결측값 있는 행 제거 : dropna(axis=0), dropna(axis=1) (4)	2016.12.10
[Python pandas] 결측값 채우기, 결측값 대체하기, 결측값 처리 (filling missing value, imputation of missing values) : df.fillna() (31)	2016.12.09
[Python pandas] DataFrame 결측값 여부 확인, 결측값 개수 : isnull(), notnull(), df.isnull().sum(), df.notnull().sum(), df.isnull().sum(1), df.notnull().sum(1) (0)	2016.12.07
[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index) (3)	2016.12.06
[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03

Posted by Rfriend

,

[Python pandas] DataFrame 결측값 여부 확인, 결측값 개수 : isnull(), notnull(), df.isnull().sum(), df.notnull().sum(), df.isnull().sum(1), df.notnull().sum(1)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 7. 23:46

DataFrame을 가지고 분석을 진행하다 보면 대부분의 경우 결측값(missing value)이 골치거리로 따라 다닙니다.

데이터가 원래 수집 혹은 측정이 안되었을 수도 있고, 다수의 DataFrame을 서로 병합하는 과정에서 결측값이 생길 수도 있으며, index를 재설정(reindex)하는 경우에도 결측값이 생길 수 있습니다.

이처럼 다양한 이유로 인해서 생기는 결측값은 분석 오류가 발생시키거나 혹은 왜곡시킬 위험이 있습니다. 따라서 분석할 DataFrame을 생성했으면 결측값(missing value)이 있는지 여부에 대해서 꼭 확인하고 조치하여야 합니다.

이번 포스팅에서는 Python pandas의 isnull(), notnull() 메소드를 활용해서 결측값이 있는지 여부를 확인하는 방법을 소개하겠습니다.

Python pandas에서는 결측값을 'NaN' 으로 표기하며, 'None'도 결측값으로 인식합니다.

먼저 결측값이 있는 DataFrame을 만들어보겠습니다.

# making DataFrame with missing values
In [1]: import pandas as pd

In [2]: from pandas import DataFrame

In [3]: df_left = DataFrame({'KEY': ['K0', 'K1', 'K2', 'K3'],
   ...: 'A': ['A0', 'A1', 'A2', 'A3'],
   ...: 'B': [0.5, 2.2, 3.6, 0.4]})

In [4]: df_right = DataFrame({'KEY': ['K2', 'K3', 'K4', 'K5'],
   ...: 'C': ['C2', 'C3', 'C4', 'C5'],
   ...: 'D': ['D2', 'D3', 'D4', 'D5']})

In [5]: df_all = pd.merge(df_left, df_right, how='outer', on='KEY')

In [6]: df_all
Out[6]:
     A    B KEY    C    D
0   A0 0.5 K0 NaN NaN
1   A1 2.2 K1 NaN NaN
2   A2 3.6 K2   C2   D2
3   A3 0.4 K3   C3   D3
4 NaN NaN K4   C4   D4
5 NaN NaN K5   C5   D5

(1) DataFrame 전체의 결측값 여부 확인 : df.isnull(), isnull(df), df.notnull(), notnull(df)

isnull() 메소드는 관측치가 결측이면 True, 결측이 아니면 False의 boollean 값을 반환합니다.

notnull() 메소드는 관측치가 결측이면 False, 결측이 아니면 True를 반환합니다.(isnull() 과 정반대)

isnull(DataFrame) 과 DataFrame.isnull() 은 동일한 값을 반환하며, notnull(DataFrame)과 DataFrame.notnull() 역시 동일한 의미의 script 입니다.

In [7]: pd.isnull(df_all)
Out[7]:
       A      B    KEY      C      D
0 False False False   True   True
1 False False False   True   True
2 False False False False False
3 False False False False False
4   True   True False False False
5   True   True False False False

In [8]: df_all.isnull()
Out[8]:
       A      B    KEY      C      D
0 False False False   True   True
1 False False False   True   True
2 False False False False False
3 False False False False False
4   True   True False False False
5   True   True False False False

In [9]: pd.notnull(df_all)
Out[9]:
       A      B   KEY      C      D
0   True   True True False False
1   True   True True False False
2   True   True True   True   True
3   True   True True   True   True
4 False False True   True   True
5 False False True   True   True

In [10]: df_all.notnull()
Out[10]:
       A      B   KEY      C      D
0   True   True True False False
1   True   True True False False
2   True   True True   True   True
3   True   True True   True   True
4 False False True   True   True
5 False False True   True   True

(2) 특정 변수, 원소에 결측값 추가하기, 결측값 여부 확인하기 : indexing & None

아래 예시의 'df_all' DataFrame 에서 ['A', 'B'] 칼럼의 ['0', '1'] index 위치에 있는 관측치에 'None'을 할당하여 결측치를 만들어보았습니다.

'A'칼럼의 경우 'string' 데이터 형식인데요, 'None'을 할당하니 'None'으로 입력되었습니다. 반면에, 'B' 칼럼의 경우 'float' 데이터 형식인데요, 'None'을 할당하니 'NaN'으로 자동으로 입력되었습니다.

DataFrame의 행, 열을 기준으로 indexing을 하고 싶을 때는 DataFrame.ix[[row1, row2], ['col1', 'col2']] 을 사용하면 됩니다. 아래 예시를 참고하세요.

In [11]: df_all
Out[11]:
     A        B  KEY    C    D
0   A0      0.5 K0 NaN NaN
1   A1      2.2 K1 NaN NaN
2   A2    3.6 K2   C2   D2
3   A3     0.4 K3   C3   D3
4 NaN NaN K4   C4   D4
5 NaN NaN K5   C5   D5

In [12]: df_all.loc[[0, 1], ['A', 'B']] = None

In [13]: df_all
Out[13]:
      A       B   KEY    C    D
0 None NaN K0 NaN NaN
1 None NaN K1 NaN NaN
2    A2     3.6   K2   C2   D2
3    A3     0.4 K3   C3   D3
4   NaN NaN K4   C4   D4
5   NaN NaN K5   C5   D5

In [14]: df_all[['A', 'B']].isnull()
Out[14]:
       A      B
0   True   True
1   True   True
2 False False
3 False False
4   True   True
5   True   True

(3) 칼럼별 결측값 개수 구하기 : df.isnull().sum()

# counting missing value numbers for all columns
In [15]: df_all.isnull().sum() Out[15]:
A      2
B      2
KEY    0
C      2
D      2
dtype: int64

# counting missing value numbers for 'A' column
In [16]: df_all['A'].isnull().sum()
Out[16]: 2

반대로, 칼럼별 결측값이 아닌 값의 개수를 구하려면 df.notnull().sum() 을 사용하면 됩니다.

# counting notnull value numbers for all columns
In [17]: df_all.notnull().sum()
Out[17]:
A      4
B      4
KEY    6
C      4
D      4
dtype: int64

(4) 행(row) 단위로 결측값 개수 구하기 : df.isnull().sum(1)
행(row) 단위로 실측값 개수 구하기 : df.notnull().sum(1)

In [18]: df_all
Out[18]:
     A    B KEY    C    D
0   A0 0.5 K0 NaN NaN
1   A1 2.2 K1 NaN NaN
2   A2 3.6 K2   C2   D2
3   A3 0.4 K3   C3   D3
4 NaN NaN K4   C4   D4
5 NaN NaN K5   C5   D5

In [19]: df_all['NaN_cnt'] = df_all.isnull().sum(1)

In [20]: df_all['NotNull_cnt'] = df_all.notnull().sum(1)

In [21]: df_all
Out[21]:
     A     B    KEY C     D     NaN_cnt   NotNull_cnt
0   A0   0.5 K0 NaN NaN      2            4
1   A1   2.2   K1 NaN NaN      2            4
2   A2   3.6 K2   C2    D2        0            6
3   A3   0.4   K3   C3    D3        0            6
4 NaN NaN K4   C4   D4        2            4
5 NaN NaN K5   C5    D5        2            4

다음번 포스팅에서는 결측값 연산에 대해서 소개하겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 결측값 채우기, 결측값 대체하기, 결측값 처리 (filling missing value, imputation of missing values) : df.fillna() (31)	2016.12.09
[Python pandas] 결측값 연산 (calculations with missing data) (0)	2016.12.08
[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index) (3)	2016.12.06
[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03
[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() (3)	2016.11.30

Posted by Rfriend

,

[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 6. 23:47

지난번 포스팅에서는 Python pandas의 merge() 함수를 사용해서 Key를 기준으로 DataFrame을 합치는 방법을 소개하였습니다.

이번 포스팅에서는 pandas의 merge(), join() 함수를 사용해서 index를 기준으로 DataFrame을 합치는 방법을 소개하도록 하겠습니다.

SQL이나 R 사용자라면 index 사용하는게 좀 낯설을 수도 있을 것 같습니다.

먼저 필요한 Library를 importing하고, 간단한 DataFrame 을 예로 만들어 보겠습니다.

In [1]: import pandas as pd

In [2]: from pandas import DataFrame

In [3]: df_left = DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],

...: 'B': ['B0', 'B1', 'B2', 'B3']},

...: index=['K0', 'K1', 'K2', 'K3'])

In [4]: df_right = DataFrame({'C': ['C2', 'C3', 'C4', 'C5'],

...: 'D': ['D2', 'D3', 'D4', 'D5']},

...: index=['K2', 'K3', 'K4', 'K5'])

...:

In [5]: df_left

Out[5]:

A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3

In [6]: df_right

Out[6]:

C D
K2 C2 D2
K3 C3 D3
K4 C4 D4
K5 C5 D5

index를 기준으로 DataFrame을 합치는 방법에는 pd.merge() 와 join() 두 가지 방법이 있는데요, join() 이 code가 간결한 편이며, code에 대한 가독성은 pd.merge()가 좀더 명확한 편입니다.

(1) index를 기준으로 Left Join 하기 (Left join on index)

# Left joining on index

# way 1 : by merge()

In [7]: pd.merge(df_left, df_right,

...: left_index=True, right_index=True,

...: how='left')

Out[7]:

     A   B    C    D
K0 A0 B0 NaN NaN
K1 A1 B1 NaN NaN
K2 A2 B2   C2   D2
K3 A3 B3   C3   D3

# way 2 : by join

In [8]: df_left.join(df_right, how='left')

Out[8]:

     A   B    C    D
K0 A0 B0 NaN NaN
K1 A1 B1 NaN NaN
K2 A2 B2   C2   D2
K3 A3 B3   C3   D3

(2) index를 기준으로 Right Join 하기 (Right join on index)

# Right join on index

# way 1 : merge()

In [9]: pd.merge(df_left, df_right,

...: left_index=True, right_index=True,

...: how='right')

Out[9]:

      A    B   C   D
K2   A2   B2 C2 D2
K3   A3   B3 C3 D3
K4 NaN NaN C4 D4
K5 NaN NaN C5 D5

# way 2 : join()

In [10]: df_left.join(df_right, how='right')

Out[10]:

      A    B   C   D
K2   A2   B2 C2 D2
K3   A3   B3 C3 D3
K4 NaN NaN C4 D4
K5 NaN NaN C5 D5

(3) index를 기준으로 inner join 하기 (inner join on index)

# inner join on index

# way 1 : by merge()

In [11]: pd.merge(df_left, df_right,

...: left_index=True, right_index=True,

...: how='inner')

Out[11]:

A B C D
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3

# way 2 : by join()

In [12]: df_left.join(df_right, how='inner')

Out[12]:

A B C D
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3

(4) index를 기준으로 outer join 하기 (outer join on index)

# outer join on index

# way 1 : by pd.merge()

In [13]: pd.merge(df_left, df_right,

...: left_index=True, right_index=True,

...: how='outer')

Out[13]:

      A    B    C    D
K0   A0   B0 NaN NaN
K1   A1   B1 NaN NaN
K2   A2   B2   C2   D2
K3   A3   B3   C3   D3
K4 NaN NaN   C4   D4
K5 NaN NaN   C5   D5

# way 2 : by join()

In [14]: df_left.join(df_right, how='outer')

Out[14]:

      A    B    C    D
K0   A0   B0 NaN NaN
K1   A1   B1 NaN NaN
K2   A2   B2   C2   D2
K3   A3   B3   C3   D3
K4 NaN NaN   C4   D4
K5 NaN NaN   C5   D5

위의 4개의 index 기준 DataFrame 병합 사례에서는 양쪽 DataFrame 모두 index를 사용했습니다.

그런데 만약 한쪽 DataFrame은 index를 기준으로 하고, 나머지 한쪽 DataFrame에서는 Key 변수를 기준으로 해서 두 DataFrame을 합쳐야 한다면 어떻게 해야 할까요?

pd.merge()와 join() 두 가지 방법을 how='left' 의 경우만 예를 들어서 설명하겠습니다. 역시 join() 이 script가 간결한 반면, pd.merge()가 병합의 기준을 명시해줌으로써 가독성은 더 좋습니다. 뭘 사용할지는 개인의 취향에 따라 선택하시면 됩니다.

(5) index와 Key를 혼합해서 DataFrame 합치기 (Joining key columns on an index)

먼저 df_left_2 는 'KEY' 를 가진 DataFrame으로 만들고, df_right_2는 index를 가진 DataFrame으로 만든 후에 이 둘을 'KEY'와 index를 혼합해서 사용해서 합쳐보겠습니다.

# making DataFrame

In [15]: df_left_2 = DataFrame({'KEY': ['K0', 'K1', 'K2', 'K3'],

...: 'A': ['A0', 'A1', 'A2', 'A3'],

...: 'B': ['B0', 'B1', 'B2', 'B3']})

...:

In [16]: df_right_2 = DataFrame({'C': ['C2', 'C3', 'C4', 'C5'],

...: 'D': ['D2', 'D3', 'D4', 'D5']},

...: index=['K2', 'K3', 'K4', 'K5'])

...:

In [17]: df_left_2 # with 'KEY'

Out[17]:

A B KEY
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3

In [18]: df_right_2 # with 'index'

Out[18]:

C D
K2 C2 D2
K3 C3 D3
K4 C4 D4
K5 C5 D5

이제 Key와 index를 혼합해서 두 DataFrame을 합쳐보겠습니다.

# joining key columns on an index

# way 1 : pd.merge()

In [19]: pd.merge(df_left_2, df_right_2,

...: left_on='KEY', right_index=True,

...: how='left')

...:

Out[19]:

    A   B KEY    C    D
0 A0 B0 K0 NaN NaN
1 A1 B1 K1 NaN NaN
2 A2 B2 K2   C2   D2
3 A3 B3 K3   C3   D3

# way 2 : join()

In [20]: df_left_2.join(df_right_2, on='KEY', how='left')

Out[20]:

    A   B KEY    C    D
0 A0 B0 K0 NaN NaN
1 A1 B1 K1 NaN NaN
2 A2 B2 K2   C2   D2
3 A3 B3 K3   C3   D3

이상으로 DataFrame을 index 기준으로 합치는 방법에 대한 소개를 마치겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 결측값 연산 (calculations with missing data) (0)	2016.12.08
[Python pandas] DataFrame 결측값 여부 확인, 결측값 개수 : isnull(), notnull(), df.isnull().sum(), df.notnull().sum(), df.isnull().sum(1), df.notnull().sum(1) (0)	2016.12.07
[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03
[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() (3)	2016.11.30
[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() (2)	2016.11.28

Posted by Rfriend

,

[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 3. 23:31

데이터 분석을 하다 보면 여기 저기 흩어져 있는 데이터를 특정한 Key를 기준으로 병합해서 분석해야 하는 경우가 매우 많습니다.

지난번 포스팅에서는 DataFrame을 pandas의 concat() 함수를 이용해서 합치는 방법, append() 함수를 사용해서 합치는 방법을 소개하였습니다.

이번 포스팅에서는 SQL을 사용해서 Database의 Table 들을 Join/Merge 하는 것과 유사하게 Python pandas의 pd.merge() 함수를 사용해서 DataFrame을 Key 기준으로 inner, outer, left, outer join 하여 합치는 방법을 소개하도록 하겠습니다.

SQL을 사용하는데 익숙한 분석가라면 매우 쉽고 빠르게 이해하실 수 있을 것입니다. 그리고 Python의 merge() 기능은 메모리 상에서 매우 빠르게 작동함으로 사용하는데 있어 불편함이 덜할 것 같습니다.

pandas merge 함수 설정값들은 아래와 같이 여러개가 있는데요, 이중에서 'how'와 'on'은 꼭 기억해두셔야 합니다.

pd.merge(left, right, # merge할 DataFrame 객체 이름
             how='inner', # left, rigth, inner (default), outer
             on=None, # merge의 기준이 되는 Key 변수
             left_on=None, # 왼쪽 DataFrame의 변수를 Key로 사용
             right_on=None, # 오른쪽 DataFrame의 변수를 Key로 사용
             left_index=False, # 만약 True 라면, 왼쪽 DataFrame의 index를 merge Key로 사용
             right_index=False, # 만약 True 라면, 오른쪽 DataFrame의 index를 merge Key로 사용
             sort=True, # merge 된 후의 DataFrame을 join Key 기준으로 정렬
             suffixes=('_x', '_y'), # 중복되는 변수 이름에 대해 접두사 부여 (defaults to '_x', '_y'
             copy=True, # merge할 DataFrame을 복사
             indicator=False) # 병합된 이후의 DataFrame에 left_only, right_only, both 등의 출처를 알 수 있는 부가 정보 변수 추가

먼저, pandas, DataFrame library를 importing 한 후에, 2개의 DataFrame을 만들어보겠습니다.

In [1]: import pandas as pd

In [2]: from pandas import DataFrame

In [3]: df_left = DataFrame({'KEY': ['K0', 'K1', 'K2', 'K3'],

...: 'A': ['A0', 'A1', 'A2', 'A3'],

...: 'B': ['B0', 'B1', 'B2', 'B3']})

...:

In [4]: df_right = DataFrame({'KEY': ['K2', 'K3', 'K4', 'K5'],

...: 'C': ['C2', 'C3', 'C4', 'C5'],

...: 'D': ['D2', 'D3', 'D4', 'D5']})

...:

In [5]: df_left

Out[5]:

A B KEY
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3

In [6]: df_right

Out[6]:

C D KEY
0 C2 D2 K2
1 C3 D3 K3
2 C4 D4 K4
3 C5 D5 K5

'how' 의 left, right, inner, outer 별로 위에서 만든 'df_left'와 'df_right' 두 개의 DataFrame을 'KEY' 변수를 기준으로 merge 해보겠습니다. SQL join에 익숙하신 분이라면 쉽게 이해할 수 있을 것입니다.

(1) Merge method : left (SQL join name : LEFT OUTER JOIN)

In [7]: df_merge_how_left = pd.merge(df_left, df_right,

...: how='left',

...: on='KEY')

...:

In [8]: df_merge_how_left

Out[8]:

    A   B KEY   C     D
0 A0 B0 K0 NaN NaN
1 A1 B1 K1 NaN NaN
2 A2 B2 K2   C2   D2
3 A3 B3 K3   C3   D3

(2) Merge method : right (SQL join name : RIGHT OUTER JOIN)

In [9]: df_merge_how_right = pd.merge(df_left, df_right,

...: how='right',

...: on='KEY')

In [10]: df_merge_how_right

Out[10]:

     A    B KEY   C   D
0   A2   B2 K2 C2 D2
1   A3   B3 K3 C3 D3
2 NaN NaN K4 C4 D4
3 NaN NaN K5 C5 D5

(3) Merge method : inner (SQL join name : INNER JOIN)

In [11]: df_merge_how_inner = pd.merge(df_left, df_right,

...: how='inner', # default

...: on='KEY')

...:

In [12]: df_merge_how_inner

Out[12]:

A B KEY C D
0 A2 B2 K2 C2 D2
1 A3 B3 K3 C3 D3

(4) Merge method : outer (SQL join name : FULL OUTER JOIN)

In [13]: df_merge_how_outer = pd.merge(df_left, df_right,

...: how='outer',

...: on='KEY')

...:

In [14]: df_merge_how_outer

Out[14]:

     A    B KEY    C    D
0   A0   B0 K0 NaN NaN
1   A1   B1 K1 NaN NaN
2   A2   B2 K2   C2   D2
3   A3   B3 K3   C3   D3
4 NaN NaN K4   C4   D4
5 NaN NaN K5   C5   D5

[참고] Hive 조인 문 : INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL JOIN, CARTESIAN PRODUCT JOIN, MAP-SIDE JOIN, UNION ALL : http://rfriend.tistory.com/216

(5) indicator = True : 병합된 이후의 DataFrame에 left_only, right_only, both 등의

출처를 알 수 있는 부가정보 변수 추가

In [15]: pd.merge(df_left, df_right, how='outer', on='KEY',

...: indicator=True)

Out[15]:

     A    B KEY    C    D      _merge
0   A0   B0 K0 NaN NaN   left_only
1   A1   B1 K1 NaN NaN   left_only
2   A2   B2 K2   C2   D2        both
3   A3   B3 K3   C3   D3        both
4 NaN NaN K4   C4   D4 right_only
5 NaN NaN K5   C5   D5 right_only

위에서는 indicator=True로 했더니 '_merge'라는 새로운 변수가 생겼습니다.

이 방법 외에도, 아래처럼 indicator='변수 이름(예: indicator_info)'을 설정해주면, 새로운 변수 이름에 indicator 정보가 반환됩니다.

In [16]: pd.merge(df_left, df_right, how='outer', on='KEY',

...: indicator='indicator_info')

Out[16]:

     A    B KEY    C    D indicator_info
0   A0   B0 K0 NaN NaN      left_only
1   A1   B1 K1 NaN NaN      left_only
2   A2   B2 K2   C2   D2           both
3   A3   B3 K3   C3   D3           both
4 NaN NaN K4   C4   D4     right_only
5 NaN NaN K5   C5   D5     right_only

(6) 변수 이름이 중복될 경우 접미사 붙이기 : suffixes = ('_x', '_y')

'B'와 'C' 의 변수 이름이 동일하게 있는 두 개의 DataFrame을 만든 후에, KEY를 기준으로 합치기(merge)를 해보겠습니다. 변수 이름이 중복되므로 Data Source를 구분할 수 있도록 suffixes = ('string', 'string') 을 사용해서 중복되는 변수의 뒷 부분에 접미사를 추가해보겠습니다. default는 suffixes = ('_x', '_y') 입니다.

# making DataFrames with overlapping columns

In [17]: df_left_2 = DataFrame({'KEY': ['K0', 'K1', 'K2', 'K3'],

...: 'A': ['A0', 'A1', 'A2', 'A3'],

...: 'B': ['B0', 'B1', 'B2', 'B3'],

...: 'C': ['C0', 'C1', 'C2', 'C3']})

In [18]: df_right_2 = DataFrame({'KEY': ['K0', 'K1', 'K2', 'K3'],

...: 'B': ['B0_2', 'B1_2', 'B2_2', 'B3_2'],

...: 'C': ['C0_2', 'C1_2', 'C2_2', 'C3_2'],

...: 'D': ['D0_2', 'D1_2', 'D2_2', 'D3_3']})

...:

In [19]: df_left_2

Out[19]:

A B C KEY
0 A0 B0 C0 K0
1 A1 B1 C1 K1
2 A2 B2 C2 K2
3 A3 B3 C3 K3

In [20]: df_right_2

Out[20]:

B C D KEY
0 B0_2 C0_2 D0_2 K0
1 B1_2 C1_2 D1_2 K1
2 B2_2 C2_2 D2_2 K2
3 B3_2 C3_2 D3_3 K3

# adding string suffixes to apply to overlapping columns

In [21]: pd.merge(df_left_2, df_right_2, how='inner', on='KEY',

...: suffixes=('_left', '_right'))

...:

Out[21]:

    A B_left C_left KEY B_right C_right     D
0 A0     B0     C0 K0    B0_2    C0_2 D0_2
1 A1     B1     C1 K1    B1_2    C1_2 D1_2
2 A2     B2     C2 K2    B2_2    C2_2 D2_2
3 A3     B3     C3 K3    B3_2    C3_2 D3_3

# suffixes defaults to ('_x', '_y')

In [22]: pd.merge(df_left_2, df_right_2, how='inner', on='KEY')

...:

Out[22]:

A B_x C_x KEY B_y C_y D
0 A0 B0 C0 K0 B0_2 C0_2 D0_2
1 A1 B1 C1 K1 B1_2 C1_2 D1_2
2 A2 B2 C2 K2 B2_2 C2_2 D2_2
3 A3 B3 C3 K3 B3_2 C3_2 D3_3

left_on, right_on, left_index, right_index 에 대해서는 다음번 포스팅에서 소개하도록 하겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame 결측값 여부 확인, 결측값 개수 : isnull(), notnull(), df.isnull().sum(), df.notnull().sum(), df.isnull().sum(1), df.notnull().sum(1) (0)	2016.12.07
[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index) (3)	2016.12.06
[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() (3)	2016.11.30
[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() (2)	2016.11.28
[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27

Posted by Rfriend

,

[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 30. 23:46

지난번 포스팅에서는 DataFrame을 Python pandas 라이브러리의 pd.concat() 함수를 사용해서 상+하로 합치기, 좌+우로 합치기를 해보았습니다.

이번 포스팅에서는 이어서 DataFrame과 Series를 pd.concat() 함수, append() 함수를 사용해서 합치기를 소개하겠습니다.

DataFrame 끼리 합치기 대비 해서 DataFrame + Series 가 index 관련해서 좀 헷갈리는게 있습니다만, 아래의 간단한 예시를 참고하면 어렵지 않게 이해할 수 있을 것입니다.

pandas, DataFrame, Series importing 부터 시작해 보시죠.

# importing libraries

In [1]: import pandas as pd

...: from pandas import DataFrame

...: from pandas import Series

(1) DataFrame에 Series '좌+우'로 합치기 : pd.concat([df, Series], axis=1)

DataFrame과 Series가 합쳐지면 DataFrame이 됩니다. axis=1 을 설정하면 '좌+우' 형태로 열(column)이 오른쪽 옆으로 늘어납니다.

새로 합쳐지는 DataFrame의 열 이름(column name)을 유심히 살펴보세요. Series의 이름(name)이 새로운 DataFrame의 변수 이름이 됩니다.

In [2]: df_1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=[0, 1, 2])

In [3]: df_1

Out[3]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [4]: Series_1 = pd.Series(['S1', 'S2', 'S3'], name='S')

In [5]: Series_1

Out[5]:

0    S1
1    S2
2    S3
Name: S, dtype: object

# Concatenating DataFrame and Series along columns (from left to right)

# concatenated column name of the new DataFrame will be the same name of Series

In [6]: pd.concat([df_1, Series_1], axis=1)

Out[6]:

A B C D S
0 A0 B0 C0 D0 S1
1 A1 B1 C1 D1 S2
2 A2 B2 C2 D2 S3

(2) DataFrame에 Series를 '좌+우'로 합칠 때

열 이름(column name) 무시하고 정수 번호 자동 부여 : ignore_index=True

In [7]: pd.concat([df_1, Series_1], axis=1, ignore_index=True)

Out[7]:

0 1 2 3 4
0 A0 B0 C0 D0 S1
1 A1 B1 C1 D1 S2
2 A2 B2 C2 D2 S3

(3) Series 끼리 '좌+우'로 합치기 : pd.concat([Series1, Series2, ...], axis=1)

만약 Series의 이름(name)이 있으면 합쳐진 DataFrame의 열 이름(column name)으로 사용됩니다. Series에 이름이 없다면 정수 0, 1, 2, ... 가 자동 부여 됩니다.

In [8]: Series_1 = pd.Series(['S1', 'S2', 'S3'], name='S')

In [9]: Series_2 = pd.Series([0, 1, 2]) # without name

In [10]: Series_3 = pd.Series([3, 4, 5]) # without name

In [11]: Series_1

Out[11]:

0    S1
1    S2
2    S3

Name: S, dtype: object

In [12]: Series_2

Out[12]:

0    0
1    1
2    2
dtype: int64

In [13]: Series_3

Out[13]:

0    3
1    4
2    5
dtype: int64

# name of Series will be used as the column name of concatenated DataFrame

In [14]: pd.concat([Series_1, Series_2, Series_3], axis=1)

Out[14]:

S 0 1
0 S1 0 3
1 S2 1 4
2 S3 2 5

(4) Series 끼리 합칠 때 열 이름(column name) 덮어 쓰기 : keys = ['xx', 'xx', ...]

In [15]: pd.concat([Series_1, Series_2, Series_3], axis=1, keys=['C0', 'C1', 'C1'])

Out[15]:

   C0 C1 C1
0 S1   0   3
1 S2   1   4
2 S3   2   5

(5) DataFrame에 Series를 '위+아래'로 합치기 : df.append(Series, ignore_index=True)

ignore_index=True 를 설정해주도록 합니다.

In [16]: df_1

Out[16]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [17]: Series_4 = pd.Series(['S1', 'S2', 'S3', 'S4'], index=['A', 'B', 'C', 'E'])

In [18]: Series_4

Out[18]:

A    S1
B    S2
C    S3
E    S4
dtype: object

In [19]: df_1.append(Series_4, ignore_index=True)

Out[19]:

    A    B   C    D    E
0 A0 B0 C0   D0 NaN
1 A1 B1 C1   D1 NaN
2 A2 B2 C2   D2 NaN
3 S1 S2 S3 NaN   S4

ignore_index=True 를 설정해주지 않으면 아래처럼 'TypeError' 가 발생합니다.

In [20]: df_1.append(Series_4) # TypeError without 'ignore_index=True'

Traceback (most recent call last):

File "<ipython-input-20-ca24d6ef8563>", line 1, in <module>

df_1.append(Series_4) # TypeError without 'ignore_index=True'

File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4314, in append

raise TypeError('Can only append a Series if ignore_index=True'

TypeError: Can only append a Series if ignore_index=True or if the Series has a name

이번 포스팅이 도움이 되었다면 아래의 '공감~♡'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index) (3)	2016.12.06
[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03
[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() (2)	2016.11.28
[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27
[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection) (2)	2016.11.27

Posted by Rfriend

,

[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 28. 23:47

분석을 하다보면 여기저기 흩어져 있는 여러 개의 데이터 테이블을 모아서 합쳐야 하는 일이 생기곤 합니다. 나를 대신해서 누군가가 데이터 전처리를 해주지 않는다고 했을 때는 말이지요.

정규화해서 Database 관리를 하는 곳이라면 주제별로 Data Entity를 구분해서 여러 개의 Table들로 데이터가 나뉘어져 있을 것입니다.

특히, 데이터의 속성 형태가 동일한 데이터셋(homogeneously-typed objects)끼리 합칠 때 사용할 수 있는 pandas의 DataFrame 합치는 방법(concatenating DataFrames)으로 이번 포스팅에서는 pd.concat() 함수를 소개하겠습니다.

(R의 rbind(), cbind() 와 유사함)

pd.concat() 의 parameter 값들의 default setting은 아래와 같습니다. 하나씩 예를 들어가면서 소개하겠습니다.

pd.concat(objs, # Series, DataFrame, Panel object

axis=0, # 0: 위+아래로 합치기, 1: 왼쪽+오른쪽으로 합치기

join='outer', # 'outer': 합집합(union), 'inner': 교집합(intersection)

~~join_axes=None, # axis=1 일 경우 특정 DataFrame의 index를 그대로 이용하려면 입력 (deprecated, 더이상 지원하지 않음)~~

ignore_index=False, # False: 기존 index 유지, True: 기존 index 무시
keys=None, # 계층적 index 사용하려면 keys 튜플 입력

levels=None,

names=None, # index의 이름 부여하려면 names 튜플 입력

verify_integrity=False, # True: index 중복 확인
copy=True) # 복사

(1-1) 위 + 아래로 DataFrame 합치기(rbind) : axis = 0

# importing libraries

In [1]: import pandas as pd

...: from pandas import DataFrame

# making DataFrames

In [2]: df_1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=[0, 1, 2])

In [3]: df_2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],

...: 'B': ['B3', 'B4', 'B5'],

...: 'C': ['C3', 'C4', 'C5'],

...: 'D': ['D3', 'D4', 'D5']},

...: index=[3, 4, 5])

In [4]: df_1

Out[4]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [5]: df_2

Out[5]:

A B C D
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5

# concatenating DataFrame1, 2 along rows, axis=0, default

In [6]: df_12_axis0 = pd.concat([df_1, df_2]) # row bind : axis = 0, default

In [7]: df_12_axis0

Out[7]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5

(1-2) 왼쪽 + 오른쪽으로 DataFrame 합치기(cbind) : axis = 1

In [8]: df_3 = pd.DataFrame({'E': ['A6', 'A7', 'A8'],

...: 'F': ['B6', 'B7', 'B8'],

...: 'G': ['C6', 'C7', 'C8'],

...: 'H': ['D6', 'D7', 'D8']},

...: index=[0, 1, 2])

In [9]: df_1

Out[9]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [10]: df_3

Out[10]:

E F G H
0 A6 B6 C6 D6
1 A7 B7 C7 D7
2 A8 B8 C8 D8

# concatenating DataFrames along columns, axis=1

In [11]: df_13_axis1 = pd.concat([df_1, df_3], axis=1) # column bind

In [12]: df_13_axis1

Out[12]:

A B C D E F G H
0 A0 B0 C0 D0 A6 B6 C6 D6
1 A1 B1 C1 D1 A7 B7 C7 D7
2 A2 B2 C2 D2 A8 B8 C8 D8

(2-1) 합집합(union)으로 DataFrame 합치기 : join = 'outer'

In [13]: df_4 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'E': ['E0', 'E1', 'E2']},

...: index=[0, 1, 3])

In [17]: df_1

Out[17]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [18]: df_4

Out[18]:

A B C E
0 A0 B0 C0 E0
1 A1 B1 C1 E1
3 A2 B2 C2 E2

In [19]: df_14_outer = pd.concat([df_1, df_4], join='outer') # union, default

In [20]: df_14_outer

Out[20]:

     A   B   C    D    E
0 A0 B0 C0   D0 NaN
1 A1 B1 C1   D1 NaN
2 A2 B2 C2   D2 NaN
0 A0 B0 C0 NaN   E0
1 A1 B1 C1 NaN   E1
3 A2 B2 C2 NaN   E2

(2-2) 교집합(intersection)으로 DataFrame 합치기 : join = 'inner'

In [21]: df_14_inner = pd.concat([df_1, df_4], join='inner') # intersection

In [22]: df_14_inner

Out[22]:

A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
0 A0 B0 C0
1 A1 B1 C1
3 A2 B2 C2

(3) axis=1일 경우 특정 DataFrame의 index를 그대로 이용하고자 할 경우 : join_axes

아래에 axis=1 (왼쪽+오른쪽) 인 경우, join='outer', join='inner', join_axes=[df.index] 의 3개 방법을 소개하였습니다. 합쳐진 DataFrame의 index 를 유심히 비교해보시기 바랍니다.

In [23]: df_1

Out[23]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

In [24]: df_4

Out[24]:

A B C E
0 A0 B0 C0 E0
1 A1 B1 C1 E1
3 A2 B2 C2 E2

# comparison 1
In [25]: df_14_outer_axis1 = pd.concat([df_1, df_4], join='outer', axis=1) # default

In [26]: df_14_outer_axis1

Out[26]:

      A    B    C    D    A    B    C    E
0   A0   B0   C0   D0   A0   B0   C0   E0
1   A1   B1   C1   D1   A1   B1   C1   E1
2   A2   B2   C2   D2 NaN NaN NaN NaN
3 NaN NaN NaN NaN   A2   B2   C2   E2

# comparison 2

In [29]: df_14_inner_axis1 = pd.concat([df_1, df_4], join='inner', axis=1)

In [30]: df_14_inner_axis1

Out[30]:

A B C D A B C E
0 A0 B0 C0 D0 A0 B0 C0 E0
1 A1 B1 C1 D1 A1 B1 C1 E1

# reuse the exact index from the original DataFrame : reindex()

In [31]: df_14_axis1_reindex = pd.concat([df_1, df_4], axis=1).reindex(df_1.index)

In [32]: df_14_axis1_reindex

Out[32]:

   A   B   C   D    A    B    C    E
0 A0 B0 C0 D0   A0   B0   C0   E0
1 A1 B1 C1 D1   A1   B1   C1   E1
2 A2 B2 C2 D2 NaN NaN NaN NaN

* (참고) 최신버전의 pandas를 사용하면서 join_axes 매개변수를 사용한다면 아래와 같은 TypeError: concat() got an unexpected keyword argument 'join_axes' 에러 메시지가 뜰 것입니다. 본 블로그를 2016년도에 썼다보니 그동안 pandas 매개변수 업데이터된 내용을 블로그 포스팅에 미처 반영 못한 부분이 있었습니다. (본문 바로잡을 수 있도록 댓글 남겨주신 김명찬님 감사합니다.)

join_axes 매개변수는 사용이 중단되었네요.(join_axes is deprecated.) 대신에 위의 In [32]의 예에서처럼 reindex() 를 사용해서 기존의 index를 재사용할 수 있습니다.

pd.concat([df_1, df_4], join_szes=[df_1.index], axis=1)

------------------------------------------------------------

Type Error Traceback (most recent call last)

<ipython-input-20-748c1e0a3504> in <module>

----> 1 df_14_join_axes_axis1 = pd.concat([df_1, df_4], join_axes=[df_1.index], axis=1)

TypeError: concat() got an unexpected keyword argument 'join_axes'

(4) 기존 index를 무시하고 싶을 때 : ignore_index

In [33]: df_5 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=['r0', 'r1', 'r2'])

In [34]: df_6 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],

...: 'B': ['B3', 'B4', 'B5'],

...: 'C': ['C3', 'C4', 'C5'],

...: 'D': ['D3', 'D4', 'D5']},

...: index=['r3', 'r4', 'r5'])

In [35]: df_56_with_index = pd.concat([df_5, df_6], ignore_index=False) # default

In [36]: df_56_with_index

Out[36]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2
r3 A3 B3 C3 D3
r4 A4 B4 C4 D4
r5 A5 B5 C5 D5

# if you want ignore current index, use 'ignore_index=True'

In [37]: df_56_ignore_index = pd.concat([df_5, df_6], ignore_index=True)# index 0~(n-1)

In [38]: df_56_ignore_index

Out[38]:

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5

(5) 계층적 index (hierarchical index) 만들기 : keys

# concatenating DataFrames : Construct hierarchical index using 'keys'

In [40]: df_56_with_keys = pd.concat([df_5, df_6], keys=['df_5', 'df_6'])

In [41]: df_56_with_keys

Out[41]:

            A   B   C   D
df_5 r0 A0 B0 C0 D0
       r1 A1 B1 C1 D1
         r2 A2 B2 C2 D2
df_6 r3 A3 B3 C3 D3
       r4 A4 B4 C4 D4
         r5 A5 B5 C5 D5

참고로, 계층적 index를 가지고 indexing 하는 방법을 아래에 예를 들어 소개하겠습니다. 'df_56_with_keys' DataFrame은 index가 1층, 2층으로 계층을 이루고 있으므로 indexing 할 때 1층용 index와 2층용 index를 따로 따로 사용하면 됩니다. 아래 예시를 참고하세요.

In [42]: df_56_with_keys.loc['df_5']

Out[42]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2

In [43]: df_56_with_keys.loc['df_5'][0:2]

Out[43]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1

(6) index에 이름 부여하기 : names

In [44]: df_56_with_name = pd.concat([df_5, df_6],

...: keys=['df_5', 'df_6'],

...: names=['df_name', 'row_number'])

In [45]: df_56_with_name

Out[45]:

                                 A   B   C   D
df_name row_number
df_5        r0                A0 B0 C0 D0
              r1                A1 B1 C1 D1
              r2                A2 B2 C2 D2
df_6         r3                A3 B3 C3 D3
              r4                A4 B4 C4 D4
              r5                A5 B5 C5 D5

(7) index 중복 여부 점검 : verify_integrity

df_7, df_8 DataFrame에 'r2' index를 중복으로 포함시킨 후에 pd.concat() 을 적용해보겠습니다. verify_integrity=False (디폴트이므로 별도 입력 안해도 됨) 에서는 아무 에러 메시지 없이 위+아래로 잘 합쳐집니다 ('r2' index가 위+아래로 2번 중복해서 나타남). 반면에, verify_integrity=True 를 설정해주면 만약 index 중복이 있을 경우 'ValueError: Indexes have overlapping values: xxx' 에러 메시지가 뜨면서 합치기가 아예 안됩니다.

In [48]: df_7 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

...: 'B': ['B0', 'B1', 'B2'],

...: 'C': ['C0', 'C1', 'C2'],

...: 'D': ['D0', 'D1', 'D2']},

...: index=['r0', 'r1', 'r2'])

...:

In [49]: df_8 = pd.DataFrame({'A': ['A2', 'A3', 'A4'],

...: 'B': ['B2', 'B3', 'B4'],

...: 'C': ['C2', 'C3', 'C4'],

...: 'D': ['D2', 'D3', 'D4']},

...: index=['r2', 'r3', 'r4'])

In [50]: df_7

Out[50]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2

In [51]: df_8

Out[51]:

A B C D
r2 A2 B2 C2 D2
r3 A3 B3 C3 D3
r4 A4 B4 C4 D4

# concatenating DataFrames without overlap checking : verify_integrity=False

In [52]: df_78_F_verify_integrity = pd.concat([df_7, df_8],

...: verify_integrity=False) # default

In [53]: df_78_F_verify_integrity

Out[53]:

A B C D
r0 A0 B0 C0 D0
r1 A1 B1 C1 D1
r2 A2 B2 C2 D2
r2 A2 B2 C2 D2
r3 A3 B3 C3 D3
r4 A4 B4 C4 D4

# index overlap checking, using verify_integrity=True

In [54]: df_78_T_verify_integrity = pd.concat([df_7, df_8],

...: verify_integrity=True)

Traceback (most recent call last):

File "<ipython-input-56-5512ad3b5016>", line 2, in <module>
verify_integrity=True)

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 845, in concat
copy=copy)

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 984, in __init__
self.new_axes = self._get_new_axes()

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 1073, in _get_new_axes
new_axes[self.axis] = self._get_concat_axis()

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 1132, in _get_concat_axis
self._maybe_check_integrity(concat_axis)

File "C:\Anaconda3\lib\site-packages\pandas\tools\merge.py", line 1141, in _maybe_check_integrity
% str(overlap))

ValueError: Indexes have overlapping values: ['r2']

많은 도움이 되었기를 바랍니다.

도움이 되었다면 아래의 '공감 ~♡'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge() (0)	2016.12.03
[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() (3)	2016.11.30
[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27
[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection) (2)	2016.11.27
[Python pandas] pd.DataFrame 만들고 Attributes 조회하기 (0)	2016.11.26

Posted by Rfriend

,

[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 27. 21:39

지난번 포스팅에서는 DataFrame 의 행과 열 기준으로 데이터 선택해서 가져오기 (indexing and selection)에 대해서 알아보았습니다.

index를 처음 만들기는 했는데요, 필요에 따라서 수정해야 할 필요가 생길 수도 있겠지요?

이번 포스팅에서는

- (1) index 재설정하기 (reindex)

- (2) reindex 과정에서 생기는 결측값 채우기
(fill in missing values)

방법에 대해서 소개하겠습니다.

먼저, 필요한 library를 import 하고, dit와 index를 사용해서 간5행, 2열을 가진 간단한 DataFrame을 만들어보겠습니다.

##-- Make a new index and reindex the dataframe

In [1]: import numpy as np

...: import pandas as pd

...: from pandas import DataFrame

In [2]: idx = ['r0', 'r1', 'r2', 'r3', 'r4']

...:

...: df_1 = pd.DataFrame({

...: 'c1': np.arange(5),

...: 'c2': np.random.randn(5)},

...: index=idx)

In [3]: df_1

Out[3]:

    c1        c2
r0   0 1.182716
r1   1 0.244398
r2   2 -1.494202
r3   3 0.146152
r4   4 -0.352680

위 예에서 df_1 DataFrame의 행 index 가 ['r0', 'r1', 'r2', 'r3', 'r4'] 인데요, ['r3', 'r4']를 빼고 ['r5', 'r6']를 새로 추가하고 싶다고 해봅시다. 이때 사용하는 것이 'reindex' 입니다.

(1-1) index 재설정하기 : reindex

##-- Make a new index and reindex the dataframe

In [4]: new_idx= ['r0', 'r1', 'r2', 'r5', 'r6']

In [5]: df_1.reindex(new_idx)

Out[5]:

     c1        c2
r0 0.0 1.182716
r1 1.0 0.244398
r2 2.0 -1.494202
r5 NaN       NaN
r6 NaN       NaN

이전에 없던 ['r5', 'r6'] index가 추가되자 'NaN' 값이 디폴트로 채워쳤습니다. 'NaN' 대신에 fill_value 파라미터를 사용해서 '0', 혹은 'missing', 'NA' 등으로 바꿔서 채워보겠습니다.

(1-2) reindex 과정에서 생긴 결측값 채우기 (fill in missing values) : fill_value

##-- Fill in the missing values by passing a value to the keyword fill_value

In [8]: df_1.reindex(new_idx, fill_value=0)

Out[8]:

    c1        c2
r0   0 1.182716
r1   1 0.244398
r2   2 -1.494202
r5   0 0.000000
r6   0 0.000000

In [9]: df_1.reindex(new_idx, fill_value='missing')

Out[9]:

         c1        c2
r0        0   1.18272
r1        1 0.244398
r2        2   -1.4942
r5 missing   missing
r6 missing   missing

In [10]: df_1.reindex(new_idx, fill_value='NA')
Out[10]:

    c1        c2
r0   0   1.18272
r1   1 0.244398
r2   2   -1.4942
r5 NA        NA
r6 NA        NA

시계열 데이터 (TimeSeries Data)는 DataFrame의 index 만들 때 pd.date_range(date, periods, freq) 를 사용합니다. (시계열 데이터 처리, 분석은 나중에 따로 많이 포스팅하겠습니다.)

먼저, 시계열 데이터로 DataFrame 만들어보겠습니다.

In [11]: date_idx = pd.date_range('11/27/2016', periods=5, freq='D')

In [12]: date_idx

Out[12]:

DatetimeIndex(['2016-11-27', '2016-11-28', '2016-11-29', '2016-11-30',

'2016-12-01'],

dtype='datetime64[ns]', freq='D')

In [13]: df_2 = pd.DataFrame({"c1": [10, 20, 30, 40, 50]}, index=date_idx)

In [14]: df_2

Out[14]:

c1

2016-11-27 10

2016-11-28 20

2016-11-29 30

2016-11-30 40

2016-12-01 50

위에서 만든 시계열 데이터 DataFrame 의 date 앞/뒤로 reindex 를 사용해서 날짜 몇 개를 새로 추가해보겠습니다.

(2-1) 시계열 데이터 index 재설정 하기 (reindex of TimeSeries Data)

In [15]: date_idx_2 = pd.date_range('11/25/2016', periods=10, freq='D')

In [16]: df_2.reindex(date_idx_2)

Out[16]:

c1

2016-11-25 NaN

2016-11-26 NaN

2016-11-27 10.0

2016-11-28 20.0

2016-11-29 30.0

2016-11-30 40.0

2016-12-01 50.0

2016-12-02 NaN

2016-12-03 NaN

2016-12-04 NaN

(2-2) 시계열 데이터 reindex 과정에서 생긴 결측값 채우기 : method='ffill', 'bfill'
(fill in missing value of TimeSeries Data)

reindex 하면서 결측값을 채우는 방법으로 method='ffill'을 사용해서 결측값 직전의 값으로 이후 결측값을 채워보겠습니다.

In [17]: df_2.reindex(date_idx_2, method='ffill') # forward-propagation

Out[17]:

                   c1
2016-11-25   NaN
2016-11-26   NaN
2016-11-27 10.0
2016-11-28 20.0
2016-11-29 30.0
2016-11-30 40.0
2016-12-01 50.0
2016-12-02 50.0
2016-12-03 50.0
2016-12-04 50.0

이번에는 reindex 하면서 method='bfill' 을 사용해서 시간 뒷 순서의 결측값으로 이전 결측값을 채워보겠습니다.

In [18]: df_2.reindex(date_idx_2, method='bfill') # back-propagation

Out[18]:

                  c1
2016-11-25 10.0
2016-11-26 10.0
2016-11-27 10.0
2016-11-28 20.0
2016-11-29 30.0
2016-11-30 40.0
2016-12-01 50.0
2016-12-02   NaN
2016-12-03   NaN
2016-12-04   NaN

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append() (3)	2016.11.30
[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() (2)	2016.11.28
[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection) (2)	2016.11.27
[Python pandas] pd.DataFrame 만들고 Attributes 조회하기 (0)	2016.11.26
[Python pandas] DataFrame을 csv 파일로 내보내기 : df.to_csv() (11)	2016.11.26

Posted by Rfriend

,

[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 27. 16:37

지난번 포스팅에서는 Python pandas의 DataFrame 만들기, Attributes 조회하기에 대해서 알아보았습니다.

이번 포스팅에서는 DataFrame의 데이터를

- (1) 행(row) 기준으로 선택해서 가져오기

- (2) 열(column) 기준으로 선택해서 가져오기

방법(DataFrame objects indexing and selection by rows or columns)에 대해서 소개하겠습니다.

먼저, 필요한 Libraries 를 importing하고, 간단한 5행 3열의 DataFrame을 만들어 보겠습니다.

In [1]: import numpy as np

...: import pandas as pd

...: from pandas import DataFrame

...:

...: ##-- Making DataFrame

...: df_2 = DataFrame({'class_1': ['a', 'a', 'b', 'b', 'c'],

...: 'var_1': np.arange(5),

...: 'var_2': np.random.randn(5)},

...: index = ['r0', 'r1', 'r2', 'r3', 'r4'])

...:

...: df_2

Out[1]:

   class_1 var_1     var_2
r0       a      0 2.896618
r1       a      1 -0.113472
r2       b      2 0.261695
r3       b      3 -0.260788
r4       c      4 -0.791744

(1) 행 기준으로 선택해서 가져오기 (indexing and selection by row)

DataFrame의 index 를 확인해보겠습니다.

In [2]: df_2.index # returning index

Out[2]: Index(['r0', 'r1', 'r2', 'r3', 'r4'], dtype='object')

'ix'를 사용하면 행 기준 indexing할 때 정수(int)와 행 이름(row label) 모두 사용할 수 있어서 편리합니다.

조건을 조금씩 달리해가면서 몇 가지 예를 아래에 들어보겠습니다. 서로 다른 점을 유심히 살펴보시면 어렵지 않게 사용법을 이해하실 수 있을 겁니다. 어렵지 않아요.

In [4]: df_2.ix[2:] # indexing from int. position to end

Out[4]:

   class_1 var_1     var_2
r2       b      2 0.261695
r3       b      3 -0.260788
r4       c      4 -0.791744

In [5]: df_2.ix[2] # indexing specific row with int. position

Out[5]:

class_1           b
var_1             2
var_2      0.261695
Name: r2, dtype: object

In [6]: df_2.ix['r2'] # indexing specific row with row label

Out[6]:

class_1           b
var_1             2
var_2      0.261695
Name: r2, dtype: object

데이터가 매우 많은 수의 행을 가지고 있을 경우에 위로 부터 n개의 행만 보고 싶은 때는 head(n) 메소드를 사용하면 됩니다.

In [7]: df_2.head(2) # Returns first n rows

Out[7]:

   class_1 var_1     var_2
r0       a      0 2.896618
r1       a      1 -0.113472

tail(n) 메소드는 행의 제일 마지막부터 n번째까지의 행 기준 데이터를 반환합니다.

In [8]: df_2.tail(2) # Returns last n rows

Out[8]:

   class_1 var_1     var_2
r3       b      3 -0.260788
r4       c      4 -0.791744

(2) 열 기준으로 선택해서 가져오기 (indexing and selection by column)

df_2 DataFrame의 열을 .columns 로 확인해 보겠습니다.

In [12]: df_2.columns
Out[12]: Index(['class_1', 'var_1', 'var_2'], dtype='object')

열(column) 기준으로 indexing할때는 '[ ]' 안에 열 이름(column label)을 'string' 형식으로 입력해주면 됩니다.

In [13]: df_2['class_1']

Out[13]:

r0    a
r1    a
r2    b
r3    b
r4    c
Name: class_1, dtype: object

두 개이상의 열(columns)을 가져오고 싶을 때는 튜플(tuple)을 사용해서 열의 이름을 나열해 주면 됩니다.

In [14]: df_2[['class_1', 'var_1']]

Out[14]:

   class_1 var_1
r0       a      0
r1       a      1
r2       b      2
r3       b      3
r4       c      4

이상으로 DataFrame Indexing and Selection에 대해서 마치겠습니다.

다음번 포스팅에서는 DataFrame index의 reindexing에 대해서 알아보겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat() (2)	2016.11.28
[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27
[Python pandas] pd.DataFrame 만들고 Attributes 조회하기 (0)	2016.11.26
[Python pandas] DataFrame을 csv 파일로 내보내기 : df.to_csv() (11)	2016.11.26
[Python pandas] DB에 접속해서 데이터 불러오기 (DB connection and SQL query) (0)	2016.11.25

Posted by Rfriend

,

[Python pandas] pd.DataFrame 만들고 Attributes 조회하기

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 11. 26. 23:51

이번 포스팅에서는 Python pandas에서 가장 중요하게 사용되는 Data 구조인

- (1) DataFrame을 만들어보고,

- (2) 다양한 Attributes 를 조회

하는 방법에 대해서 알아보겠습니다.

먼저 필요한 Library 들을 importing 하겠습니다.

In [1]: import numpy as np

...: import pandas as pd

...: from pandas import DataFrame as df

(1) pandas DataFrame 만들기

pd.DataFrame() 에서 사용하는 Paraeter 들에는 (1) data, (2) index, (3) columns, (4) dtype, (5) copy 의 5가지가 있습니다.

(1-1) data : numpy ndarray, dict, DataFrame 등의 data source

(1-2) index : 행(row) 이름, 만약 명기하지 않으면 np.arange(n)이 자동으로 할당 됨

(1-3) column : 열(column) 이름, 만약 명기하지 않으면 역시 np.arnage(n)이 자동으로 할당 됨

(1-4) dtype : 데이터 형태(type), 만약 지정하지 않으면 Python이 자동으로 추정해서 넣어줌

(1-5) copy : 입력 데이터를 복사할지 지정. 디폴트는 False 임. (복사할 거 아니면 메모리 관리 차원에서 디폴트인 False 설정 사용하면 됨)

3행 4열짜리 간단한 DataFrame을 만들어보겠습니다. data 란에 input data 지정은 필수로 해줘야 하구요, 나머지 index, columns, dtype, copy는 별도로 명기를 안해줘도 디폴트 세팅이 적용되어서 DataFrame이 생성이 되긴 합니다.

In [2]: df_1 = df(data=np.arange(12).reshape(3, 4),

...: index=['r0', 'r1', 'r2'], # Will default to np.arange(n) if no indexing

...: columns=['c0', 'c1', 'c2', 'c3'],

...: dtype='int', # Data type to force, otherwise infer

...: copy=False) # Copy data from inputs

In [3]: df_1

Out[3]:
    c0 c1 c2 c3
r0   0   1   2   3
r1   4   5   6   7
r2   8   9 10 11

(2) DataFrame 의 Attributes 조회하기

다음으로 DataFrame의 Attributes을 조회하는 방법을 소개하겠습니다.

참고로, 아래 Attributes의 끝에는 괄호 ()를 붙이지 않으니 헷갈리지 않도록 조심하세요.

(2-1) T : 행과 열 전치 (transpose)

In [5]: df_1.T # Transpose index and columns

Out[5]:

c3   3   7 11
c0   0   4   8
c1   1   5   9
c2   2   6 10
c3   3   7 11

(2-2) axes : 행과 열 이름을 리스트로 반환

In [6]: df_1.axes

Out[6]:

[Index(['r0', 'r1', 'r2'], dtype='object'),

Index(['c0', 'c1', 'c2', 'c3'], dtype='object')]

(2-3) dtypes : 데이터 형태 반환

In [7]: df_1.dtypes # Return the dtypes in this object

Out[7]:

c0 int32

c1 int32

c2 int32

c3 int32

dtype: object

(2-4) shape : 행과 열의 개수(차원)을 튜플로 반환

In [22]: df_1.shape # Return a tuple representing the dimensionality of the DataFrame

Out[22]: (3, 4)

(2-5) size : NDFrame의 원소의 개수를 반환

In [23]: df_1.size # number of elements in the NDFrame

Out[23]: 12

(2-6) values : NDFrame의 원소를 numpy 형태로 반환

In [24]: df_1.values # Numpy representation of NDFrame

Out[24]:

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

다음번 포스팅에서는 DataFrame에서 indexing 하는 방법을 소개하겠습니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values) (4)	2016.11.27
[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection) (2)	2016.11.27
[Python pandas] DataFrame을 csv 파일로 내보내기 : df.to_csv() (11)	2016.11.26
[Python pandas] DB에 접속해서 데이터 불러오기 (DB connection and SQL query) (0)	2016.11.25
[Python pandas] text, csv 파일 불러오기 : pd.read_csv() (18)	2016.11.22

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'파이썬'에 해당되는 글 151건

[Python pandas] 결측값 채우기, 결측값 대체하기, 결측값 처리 (filling missing value, imputation of missing values) : df.fillna()

(1) 결측값을 특정 값으로 채우기 (replace missing values with scalar value) : df.fillna(0)

(2) 결측값을 앞 방향 혹은 뒷 방향으로 채우기 (fill gaps forward or backward)
: fillna(method='ffill' or 'pad'), fillna(method='bfill' or 'backfill')

(3) 앞/뒤 방향으로 결측값 채우는 회수를 제한하기 (limit the amount of filling)
: fillna(method='ffill', limit=number), fillna(method='bfill', limit=number)

(4) 결측값을 변수별 평균으로 대체하기(filling missing values with mean per columns)
: df.fillna(df.mean()), df.where(pd.notnull(df), df.mean(), axis='columns')

(5) 결측값을 다른 변수의 값으로 대체하기
(filling missing values with another columns' values)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 결측값 연산 (calculations with missing data)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame 결측값 여부 확인, 결측값 개수 : isnull(), notnull(), df.isnull().sum(), df.notnull().sum(), df.isnull().sum(1), df.notnull().sum(1)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame을 index 기준으로 합치기 (merge, join on index)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Database처럼 DataFrame Join/Merge 하기 : pd.merge()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame과 Series 합치기 : pd.concat(), append()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 여러개의 동일한 형태 DataFrame 합치기 : pd.concat()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 index 재설정(reindex) 와 결측값 채우기(fill in missing values)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 행 또는 열 데이터 선택해서 가져오기 (DataFrame objects indexing and selection)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] pd.DataFrame 만들고 Attributes 조회하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

'파이썬'에 해당되는 글 151건

(1) 결측값을 특정 값으로 채우기 (replace missing values with scalar value) : df.fillna(0)

(2) 결측값을 앞 방향 혹은 뒷 방향으로 채우기 (fill gaps forward or backward) : fillna(method='ffill' or 'pad'), fillna(method='bfill' or 'backfill')

(3) 앞/뒤 방향으로 결측값 채우는 회수를 제한하기 (limit the amount of filling) : fillna(method='ffill', limit=number), fillna(method='bfill', limit=number)

(4) 결측값을 변수별 평균으로 대체하기(filling missing values with mean per columns) : df.fillna(df.mean()), df.where(pd.notnull(df), df.mean(), axis='columns')

(5) 결측값을 다른 변수의 값으로 대체하기 (filling missing values with another columns' values)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

(2) 결측값을 앞 방향 혹은 뒷 방향으로 채우기 (fill gaps forward or backward)
: fillna(method='ffill' or 'pad'), fillna(method='bfill' or 'backfill')

(3) 앞/뒤 방향으로 결측값 채우는 회수를 제한하기 (limit the amount of filling)
: fillna(method='ffill', limit=number), fillna(method='bfill', limit=number)

(4) 결측값을 변수별 평균으로 대체하기(filling missing values with mean per columns)
: df.fillna(df.mean()), df.where(pd.notnull(df), df.mean(), axis='columns')

(5) 결측값을 다른 변수의 값으로 대체하기
(filling missing values with another columns' values)