'파이썬' 태그의 글 목록 (4 Page)

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 23. 18:41

지난번 포스팅에서는 날짜-시간 시계열 객체(date-time, Timeseries objects)를 문자열(Strings)로 변환하기, 거꾸로 문자열을 날짜-시간 시계열 객체로 변환하는 방법(https://rfriend.tistory.com/498)을 소개하였습니다.

이번 포스팅에서는 날짜-시간 시계열 데이터(date-time time series)를 index로 가지는 Python pandas의 Series, DataFrame 에서 특정 날짜-시간을 indexing, slicing, selection, truncation 하는 방법을 소개하겠습니다.

(1) pandas Series에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

(2) pandas DataFrame에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

(1) pandas Series에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

먼저, 간단한 예제로 사용하도록 2019년 11월 25일 부터 ~ 2019년 12월 4일까지 10일 기간의 년-월-일 날짜를 index로 가지는 pands Series를 만들어보겠습니다.

pandas.date_range(시작날짜, periods=생성할 날짜-시간 개수) 함수를 사용하여 날짜-시간 데이터를 생성하였으며, 이를 index로 하여 pandas Series를 만들었습니다.

import pandas as pd

from datetime import datetime

# DatetimeIndex

ts_days_idx = pd.date_range('2019-11-25', periods=10)

ts_days_idx

[Out]:
DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

# Series with time series index

series_ts = pd.Series(range(len(ts_days_idx))

, index=ts_days_idx)

series_ts

[Out]:

2019-11-25    0
2019-11-26    1
2019-11-27    2
2019-11-28    3
2019-11-29    4
2019-11-30    5
2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

series_ts.index

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

series_ts.index[6]

[Out]: Timestamp('2019-12-01 00:00:00', freq='D')

참고로, 아례의 예처럼 pd.date_range(start='시작 날짜-시간', end='끝 날짜-시간') 처럼 명시적으로 시작과 끝의 날짜-시간을 지정해주어도 위의 perieds를 사용한 예와 동일한 결과를 얻을 수 있습니다.

import pandas as pd

pd.date_range(start='2019-11-25', end='2019-12-04')

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

참고로 하나더 소개하자면요, pandas.date_range('시작날짜-시간', period=생성할 날짜-시간 개수, freq='주기 단위') 에서 freq 옵션을 통해서 'S' 1초 단위, '10S' 10초 단위, 'H' 1시간 단위, 'D' 1일 단위, 'M' 1달 단위(월 말일 기준), 'Y' 1년 단위 (년 말일 기준) 등으로 날짜-시간 시계열 데이터 생성 주기를 설정할 수 있습니다. 매우 편하지요?!

< 1초 단위로 날짜-시간 데이터 10개를 생성한 예 >

# 10 timeseries data points by Second(freq='S')

pd.date_range('2019-11-25 00:00:00', periods=10, freq='S')

[Out]:

DatetimeIndex(['2019-11-25 00:00:00', '2019-11-25 00:00:01',
               '2019-11-25 00:00:02', '2019-11-25 00:00:03',
               '2019-11-25 00:00:04', '2019-11-25 00:00:05',
               '2019-11-25 00:00:06', '2019-11-25 00:00:07',
               '2019-11-25 00:00:08', '2019-11-25 00:00:09'],
              dtype='datetime64[ns]', freq='S')

< 10초 단위로 날짜-시간 데이터 10개를 생성한 예 >

# 10 timeseries data points by 10 Seconds (freq='10S')

pd.date_range('2019-11-25 00:00:00', periods=10, freq='10S')

[Out]:
DatetimeIndex(['2019-11-25 00:00:00', '2019-11-25 00:00:10',
               '2019-11-25 00:00:20', '2019-11-25 00:00:30',
               '2019-11-25 00:00:40', '2019-11-25 00:00:50',
               '2019-11-25 00:01:00', '2019-11-25 00:01:10',
               '2019-11-25 00:01:20', '2019-11-25 00:01:30'],
              dtype='datetime64[ns]', freq='10S')

(1-1) 시계열데이터를 index로 가지는 pandas Series에서 특정 날짜-시간 데이터 indexing 하기

먼저 위에서 생성한 series_ts 라는 이름의 시간 순서대로 정렬되어 있는 Series 에서 7번째에 위치한 '2019-12-01' 의 값 '6'을 indexing 해보겠습니다.

(a), (b)와 같이 위치(position)를 가지고 인덱싱할 수 있습니다.

또한, (c), (d)와 같이 날짜-시간 문자열(String)을 가지고도 인덱싱(indexing)을 할 수 있습니다.

(e) 처럼 datetime.datetime(year, month, day) 객체를 사용해서도 인덱싱할 수 있습니다.

import pandas as pd

from datetime import datetime

# (a) indexing with index number

series_ts[6]

[Out]: 6

# (b) indexing with index number using iloc

series_ts.iloc[6]

[Out]: 6

# (c) indexing with string ['year-month-day']

series_ts['2019-12-01']

[Out]: 6

# (d) indexing with string ['month/day/year']

series_ts['12/01/2019']

[Out]: 6

# (f) indexing with datetime.datetime(year, month, day)

series_ts[datetime(2019, 12, 1)]

[Out]: 6

(1-2) 시계열데이터를 index로 가지는 pandas Series에서 날짜-시간 데이터 Slicing 하기

아래는 '2019-12-01' 일 이후의 값을 모두 slicing 해오는 5가지 방법입니다.

(a), (b)는 위치(position):위치(position)을 이용하여 날짜를 index로 가지는 Series를 slicing을 하였습니다.

(c), (d)는 '년-월-일':'년-월-일' 혹은 '월/일/년':'월/일/년' 문자열(string)을 이용하여 slicing을 하였습니다.

(e)는 datetime.datetime(년, 월, 일):datetime.datetime(년, 월, 일) 을 이용하여 slicing을 하였습니다.

import pandas as pd

from datetime import datetime

# (a) slicing with position

series_ts[6:]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (b) slicing with position using iloc

series_ts.iloc[6:]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (c) slicing with string

series_ts['2019-12-01':'2019-12-10']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (d) slicing with string

series_ts['12/01/2019':'12/10/2019']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (e) slicing with datetime

series_ts[datetime(2019, 12, 1):datetime(2019, 12, 10)]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

(1-3) 시계열데이터를 index로 가지는 pandas Series 에서 날짜-시간 데이터 Selection 하기

'날짜-시간' 문자열(String)을 이용하여 특정 '년', '월'의 모든 데이터를 선택할 수도 있습니다. 꽤 편리하고 재미있는 기능입니다.

< '2019'년 모든 데이터 선택하기 예 >

# selection with year string

series_ts['2019']

[Out]:

2019-11-25 0 2019-11-26 1 2019-11-27 2 2019-11-28 3 2019-11-29 4 2019-11-30 5 2019-12-01 6 2019-12-02 7 2019-12-03 8 2019-12-04 9 Freq: D, dtype: int64

< '2019년 12월' 모든 데이터 선택하기 예 >

# selection with year-month string

series_ts['2019-12']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

(1-4) 시계열 데이터를 index로 가지는 pandas Series에서 날짜-시간 데이터 잘라내기 (Truncate)

truncate() methods를 사용하면 잘라내기(truncation)를 할 수 있습니다. before, after 옵션으로 잘라내기하는 범위 기간을 설정할 수 있는데요, 해당 날짜 포함 여부를 유심히 살펴보기 바랍니다.

< '2019년 12월 1일' 이전(before) 모든 데이터 잘라내기 예 >

('2019년 11월 30일'까지의 모든 데이터 삭제하며, '2019-12-01'일 데이터는 남아 있음)

# truncate before

series_ts.truncate(before='2019-12-01')

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

< '2019년 11월 30일' 이후(after) 모든 데이터 잘라내기 예 >

(''2019년 12월 1일' 부터의 모든 데이터 삭제하며, '2019-11-30'일 데이터는 남아 있음)

# truncate after

series_ts.truncate(after='2019-11-30')

[Out]:

2019-11-25    0
2019-11-26    1
2019-11-27    2
2019-11-28    3
2019-11-29    4
2019-11-30    5
Freq: D, dtype: int64

(2) pandas DataFrame에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

위의 (1)번에서 소개했던 pandas Series의 시계열 데이터 indexing, slicing, selection, truncation 방법을 동일하게 pandas DataFrame에도 사용할 수 있습니다.

년-월-일 날짜를 index로 가지는 간단한 pandas DataFrame 예제를 만들어보겠습니다.

import pandas as pd

from datetime import datetime

# DatetimeIndex

ts_days_idx = pd.date_range('2019-11-25', periods=10)

ts_days_idx

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

# DataFrame with DatetimeIndex

df_ts = pd.DataFrame(range(len(ts_days_idx))

, columns=['col']

, index=ts_days_idx)

df_ts

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-1) 시계열데이터를 index로 가지는 pandas DataFrame에서 특정 날짜-시간 데이터 indexing 하기

위의 (1-1) Series indexing과 거의 유사한데요, DataFrame에서는 df_ts[6], df_ts[datetime(2019, 12, 1)] 의 두가지 방법은 KeyError 가 발생해서 사용할 수 없구요, 아래의 3가지 방법만 indexing에 사용 가능합니다.

(a) iloc[integer] 메소드를 사용하여 위치(position) 로 indexing 하기

(b), (c) loc['label'] 메소드를 사용하여 이름('label')로 indexing 하기

# (a) indexing with index position integer using iloc[]

df_ts.iloc[6]

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

# (b) indexing with index labels ['year-month-day']

df_ts.loc['2019-12-01']

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

# (c) indexing with index labels ['month/day/year']

df_ts.loc['12/01/2019']

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

(2-2) 시계열데이터를 index로 가지는 pandas DataFrame에서 날짜-시간 데이터 Slicing 하기

아래는 '2019-12-01' 일 이후의 값을 모두 slicing 해오는 4가지 방법입니다.

(a) 위치(position):위치(position)을 이용하여 날짜를 index로 가지는 Series를 slicing을 하였습니다.

(b), (c)는 loc['년-월-일']:loc['년-월-일'] 혹은 loc['월/일/년']:loc['월/일/년'] 문자열(string)을 이용하여 slicing을 하였습니다.

(d) 는 loc[datetime.datetime(year, month, day):datetime.datetime(year, month, day)] 로 slicing을 한 예입니다.

# (a) slicing DataFrame with position integer

df_ts[6:10]

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (b) silcing using date strings 'year-month-day'

df_ts.loc['2019-12-01':'2019-12-10']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (c) slicing using date strings 'month/day/year'

df_ts.loc['12/01/2019':'12/10/2019']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (d) slicing using datetime objects

from datetime import datetime

df_ts.loc[datetime(2019, 12, 1):datetime(2019, 12, 10)]

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-3) 시계열데이터를 index로 가지는 pandas DataFrame 에서 날짜-시간 데이터 Selection 하기

'년', '년-월' 날짜 문자열을 df.loc['year'], df.loc['year-month'] 에 입력하면 해당 년(year), 월(month)의 모든 데이터를 선택할 수 있습니다.

< '2019년'의 모든 데이터 선택 예 >

# selection of year '2019'

df_ts.loc['2019'] # df_ts['2019']

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

< '2019년 12월'의 모든 데이터 선택 예 >

# selection of year-month '2019-12'

df_ts.loc['2019-12']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-4) 시계열 데이터를 index로 가지는 pandas DataFrame에서 날짜-시간 데이터 잘라내기 (Truncate)

truncate() 메소드를 사용하면 before 이전 기간의 데이터를 잘라내거나 after 이후 기간의 데이터를 잘라낼 수 있습니다.

< '2019-12-01' 일 이전(before) 기간 데이터 잘라내기 예 >

('2019-12-01'일은 삭제되지 않고 남아 있음)

# truncate before

df_ts.truncate(before='2019-12-01') # '2019-12-01' is not removed

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

< '2019-11-30'일 이후(after) 기간 데이터 잘라내기 예 >

('2019-11-30'일은 삭제되지 않고 남아 있음)

# truncate after

df_ts.truncate(after='2019-11-30') # '2019-11-30' is not removed

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (0)	2019.12.24
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15

Posted by Rfriend

,

[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2019. 10. 7. 23:24

이번 포스팅에서는 Jupyter Notebook을 사용하는데 있어 자주 쓰는 것은 아니지만 한번 쓰려고 하면 방법을 찾으려고 하면 또 시간을 빼앗기곤 하는, 그래서 어디에 메모해두었다가 필요할 때 꺼내쓰기에 아기자기한 팁들을 모아보았습니다.

(1) Jupyter Notebook cell 너비 설정하기

(2) Jupyter Notebook에서 DataFrame 칼럼 최대 너비 설정하기

(3) Jupyter Notebook에서 DataFrame 내 텍스트를 왼쪽으로 정렬하기

(4) Jupyter Notebook에서 DataFrame 소수점 자리수 설정하기

(5) Jupyter Notebook에서 matplotlib plot 기본 옵션 설정하기 (figure size, line width, color, grid)

(6) Jupyter Notebook에서 출력하는 행의 최대개수 설정하기 (number of max rows)

(1) Jupyter Notebook cell 너비 설정하기 (setting cell width in Jupyter Notebook)

아래 코드의 부분의 숫자를 바꾸어주면 됩니다.

from IPython.core.display import display, HTML

display(HTML("<style>.container { width: 50% !important; }</style>"))

(2) Jupyter Notebook에서 DataFrame 칼럼 최대 너비 설정하기

(setting the max-width of DataFrame's column in Jupyter Notebook)

import pandas as pd

pd.set_option('display.max.colwidth', 10)

df = pd.DataFrame({'a': [100000000.0123, 20000000.54321],

'b': ['abcdefghijklmnop', 'qrstuvwxyz']})

df

Out[2]:

	a	b
0	1.0000...	abcdef...
1	2.0000...	qrstuv...

pd.set_option('display.max.colwidth', 50)

df

Out[3]:

	a	b
0	1.000000e+08	abcdefghijklmnop
1	2.000000e+07	qrstuvwxyz

(3) Jupyter Notebook에서 DataFrame 내 텍스트를 왼쪽으로 정렬하기

(align text of pandas DataFrame to left in Jupyter Notebook)

dfStyler = df.style.set_properties(**{'text-align': 'left'})

dfStyler.set_table_styles([dict(selector='th',

props=[('text-align', 'left')])])

Out[4]:

	a	b
0	1e+08	abcdefghijklmnop
1	2e+07	qrstuvwxyz

(4) Jupyter Notebook에서 DataFrame 소수점 자리수 설정하기

(setting the decimal point format of pandas DataFrame in Jupyter Notebook)

아래에 예시로 지수형 표기를 숫자형 표기로 바꾸고, 소수점 2째자리까지만 나타내도록 숫자 표기 포맷을 설정해보겠습니다.

import pandas as pd

pd.options.display.float_format = '{:.2f}'.format

df

Out[6]:

	a	b
0	100000000.01	abcdefghijklmnop
1	20000000.54	qrstuvwxyz

(5) Jupyter Notebook에서 matplotlib plot 기본 옵션 설정하기

(setting figure size, line width, color, grid of matplotlib plot in Jupyter Notebook)

matplotlib.pyplot 의 plt.rcParams[] 를 사용하여 그래프 크기, 선 너비, 선 색깔, 그리드 포함 여부 등을 설정할 수 있습니다.

# matplotlib setting

import matplotlib.pylab as plt

%matplotlib inline

plt.rcParams["figure.figsize"] = (6, 5)

plt.rcParams["lines.linewidth"] = 2

plt.rcParams["lines.color"] = 'r'

plt.rcParams["axes.grid"] = True

# simple plot

x = [1, 2, 3, 4, 5]

y = [2, 3.5, 5, 8, 9]

plt.plot(x, y)

plt.show()

(6) Jupyter Notebook에서 출력하는 최대 행의 개수 설정하기 (number of max rows)

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 행의 개수가 많을 경우 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 처음 5개행과 마지막 5개 행만 선별해서 보여줍니다.

# if there are many rows, JN does not print all rows

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))
df

그런데, 필요에 따라서는 전체 행을 프린트해서 눈으로 확인해봐야할 때도 있습니다. 이럴 경우 Jupyter Notebook에서 출력하는 최대 행의 개수 (the number of max rows in Jupyter Notebook) 를 설정할 수 있습니다. 위의 예에서 5번째 행부터 점선으로 처리되어 건너뛰고 안보여지던 행이, pd.set_option('display.max_rows', 100) 으로 최대 인쇄되는 행의 개수를 100개로 늘려서 설정하니 이번에는 100개 행이 전부 다 인쇄되어 볼 수 있습니다.

# setting the number of maximum rows in Jupyter Notebook

import pandas as pd

pd.set_option('display.max_rows', 100)

df

(7) Jupyter Notebook에서 출력하는 최대 열의 개수 설정하기 (number of max columns)

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 열의 개수가 20개를 넘어가면 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 나머지 20개만 출력을 해줍니다. 아래의 화면 캡쳐 예시처럼 중간 열 부분의 결과가 "..." 으로 가려져있어서 확인할 수가 없습니다.

이럴 경우에 pandas.set_option('display.max.columns', col_num) 옵션을 사용하여 화면에 보여주는 최대 열의 개수 (maximum number of columns)"를 옵션으로 설정해줄 수 있습니다. 위에서는 ... 으로 처리되었던 열10 ~ 열12 가 아래에서는 제대로 jupyter notebook에 출력되었네요!

# setting the maximum number of columns in jupyter notebook display

import pandas as pd

pd.set_option('display.max.columns', 50) # set the maximum number whatever you want

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~

'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기 (0)	2020.03.01
[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30
[R] Jupyter Notebook에서 R 사용하기 (6)	2019.10.06
[Python] Windows10에서 Anaconda Prompt를 이용해 가상환경 만들기 (Create a new virtual environment for python with anaconda prompt) (0)	2019.07.19
[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법 (2)	2019.07.02

Posted by Rfriend

,

[R] Jupyter Notebook에서 R 사용하기

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2019. 10. 6. 14:31

Python은 Jupyter Notebook, R은 RStudio 라고만 알고 있는 분도 계실텐데요, IRKernel을 설치해주면 Jupyter Notebook에서도 R을 사용할 수 있습니다.

이번 포스팅에서는 Jupyter Notebook에서 R을 사용할 수 있도록 하는 방법을 소개하겠습니다.

(MacOS, Anaconda, Python3.6 version 환경)

명령 프롬프트 창(prompt window, terminal) 에서 아래의 절차에 따라서 IRKernel을 설치해주시면 됩니다.

(RStudion에서 설치를 하려고 하면 에러가 납니다. 명령 프롬프트창/ Terminal에서 진행하기 바랍니다)

1. Jupyter Notebook 사용할 수 있도록 IRKernel 설치하기

(1) 명령 프롬프트 창(terminal)에서 'R' 을 써주고 엔터 => R을 실행합니다.

MacBook-Pro:~ ihongdon$

MacBook-Pro:~ ihongdon$ R

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"

Platform: x86_64-apple-darwin15.6.0 (64-bit)

(2) devtools 패키지를 설치합니다: install.packages('devtools')

'현재 세션에서 사용할 CRAN 미러를 선택해 주세요' 라는 메시지와 함께 콤보 박스 창이 뜨면 서버 아무거나 선택하면 됩니다. 저는 'SEOUL' 선택했습니다.

>

> install.packages('devtools')

--- 현재 세션에서 사용할 CRAN 미러를 선택해 주세요 ---

SEOUL

경고: 저장소 https://cran.seoul.go.kr/bin/macosx/el-capitan/contrib/3.6에 대한 인덱스에 접근할 수 없습니다:

URL 'https://cran.seoul.go.kr/bin/macosx/el-capitan/contrib/3.6/PACKAGES'를 열 수 없습니다

소스형태의 패키지 ‘devtools’(들)를 설치합니다.

URL 'https://cran.seoul.go.kr/src/contrib/devtools_2.2.1.tar.gz'을 시도합니다

Content type 'application/x-gzip' length 372273 bytes (363 KB)

==================================================

downloaded 363 KB

* installing *source* package ‘devtools’ ...

** 패키지 ‘devtools’는 성공적으로 압축해제되었고, MD5 sums 이 확인되었습니다

** using staged installation

** R

** inst

** byte-compile and prepare package for lazy loading

** help

*** installing help indices

*** copying figures

** building package indices

** installing vignettes

** testing if installed package can be loaded from temporary location

** testing if installed package can be loaded from final location

** testing if installed package keeps a record of temporary installation path

* DONE (devtools)

다운로드한 소스 패키지들은 다음의 위치에 있습니다

‘/private/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/Rtmpg1bePN/downloaded_packages’

>

(3) IRkernel 설치: devtools::install_github('IRkernel/IRkernel') 실행합니다.

>

> devtools::install_github('IRkernel/IRkernel')

Downloading GitHub repo IRkernel/IRkernel@master

'/usr/local/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/jupyter/jupyter_kernel_test.git /var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/jkt

'/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/jkt'에 복제합니다...

remote: Enumerating objects: 12, done.

remote: Counting objects: 100% (12/12), done.

remote: Compressing objects: 100% (11/11), done.

remote: Total 12 (delta 1), reused 3 (delta 0), pack-reused 0

오브젝트 묶음 푸는 중: 100% (12/12), 완료.

'/usr/local/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/flying-sheep/ndjson-testrunner.git /var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/njr

'/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/njr'에 복제합니다...

remote: Enumerating objects: 10, done.

remote: Counting objects: 100% (10/10), done.

remote: Compressing objects: 100% (8/8), done.

remote: Total 10 (delta 0), reused 6 (delta 0), pack-reused 0

오브젝트 묶음 푸는 중: 100% (10/10), 완료.

Skipping 1 packages ahead of CRAN: htmltools

checking for file ‘/private/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/Rtmpg1bePN/remotes36921150b621/IRkernel✔ checking for file ‘/private/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/DESCRIPTION’

─ preparing ‘IRkernel’:

✔ checking DESCRIPTION meta-information ...

─ checking for LF line-endings in source and make files and shell scripts

─ checking for empty or unneeded directories

Removed empty directory ‘IRkernel/example-notebooks’

─ building ‘IRkernel_1.0.2.9000.tar.gz’

* installing *source* package ‘IRkernel’ ...

** using staged installation

** R

** inst

** byte-compile and prepare package for lazy loading

** help

*** installing help indices

** building package indices

** testing if installed package can be loaded from temporary location

** testing if installed package can be loaded from final location

** testing if installed package keeps a record of temporary installation path

* DONE (IRkernel)

>

(4) IRkernel::installspec() 확인

>

> IRkernel::installspec()

[InstallKernelSpec] Installed kernelspec ir in /Users/ihongdon/Library/Jupyter/kernels/ir

>

만약 아래와 같은 메시지가 떴다면 이는 아마도 위의 (1), (2), (3) 절차를 "명령 프롬프트 창(Terminal)"에서 실행한 것이 아니라 "RStudio"에서 실행했기 때문일 것입니다.

jupyter-client has to be installed but “jupyter kernelspec –version” exited with code 127.

위의 메시지가 떴다면 명령 프롬프트 창(prompt window, Terminal)을 하나 열고 위의 (1), (2), (3) 절차를 실행한 후에 (4)로 확인을 해보세요.

Windows10 OS를 사용하는 분이라면 PATH에 아래 경로를 추가로 등록해보시기 바랍니다.

Anaconda\Lib\site-packages\jupyter_client

C:\Users\Anaconda3\Scripts

참고로 Windows 10, Windows 8에서 PATH 등록하는 법은 아래와 같습니다.

[ Windows10 또는 Windows8 에서 PATH 등록 ]

‘시스템(제어판)’ 선택

‘고급 시스템 설정 링크’ 선택(클릭)

‘환경변수’ 선택 —> 시스템 변수 섹션에서 ‘PATH 환경변수’ 선택 —> ‘편집’ 선택 —> PATH 환경변수가 존재하지 않을 경우 ‘새로 만들기’ 선택

‘시스템 변수 편집 (또는 새 시스템 변수)’ 창에서 PATH 환경 변수의 값을 지정 —> ‘확인’ 선택 —> 나머지 창 모두 닫기

2. Jupyter Notebook에서 R 사용해보기

R kernel 도 잘 설치가 되었으니 Jupyter Notebook에서 R을 사용해보겠습니다. 위의 R이 실행 중인 terminal 에서 'ctrl + z' 를 눌러서 빠져나옵니다.

$ conda info -e (혹은 conda env list) 로 가상환경 목록을 확인하고,

$ source activate [가상환경 이름] 으로 특정 가상환경으로 들어갑니다.

(Windows OS 에서는 activate [가상환경 이름])

$ jupyter notebook 로 jupyter notebook 창을 열어줍니다.

>

[1]+ Stopped R

MacBook-Pro:~ ihongdon$

MacBook-Pro:~ ihongdon$ conda info -e

# conda environments:

#

base * /Users/ihongdon/anaconda3

py2.7_tf1.4 /Users/ihongdon/anaconda3/envs/py2.7_tf1.4

py3.5_tf1.4 /Users/ihongdon/anaconda3/envs/py3.5_tf1.4

py3.6_tf2.0 /Users/ihongdon/anaconda3/envs/py3.6_tf2.0

MacBook-Pro:~ ihongdon$

MacBook-Pro:~ ihongdon$ source activate py3.6_tf2.0

(py3.6_tf2.0) MacBook-Pro:~ ihongdon$

(py3.6_tf2.0) MacBook-Pro:~ ihongdon$ jupyter notebook

아래처럼 Jupyter Notebook 창이 뜨면 오른쪽 상단의 'New' 메뉴를 클릭하고, 여러 하위 메뉴 중 'R' 선택합니다.

간단한 예제로 x, y 두개 변수로 구성된 R 데이터프레임을 하나 만들고, ggplot2 로 산점도를 그려보았습니다.

다음으로 종속변수 y에 대해 설명변수 x를 사용한 선형회귀모형을 적합시켜 보았습니다.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30
[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법 (0)	2019.10.07
[Python] Windows10에서 Anaconda Prompt를 이용해 가상환경 만들기 (Create a new virtual environment for python with anaconda prompt) (0)	2019.07.19
[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법 (2)	2019.07.02
맥북(Mac OS)에서 graphviz 실행 시 "ValueError: Program dot not found in path" 에러 대처방안 (0)	2018.08.31

Posted by Rfriend

,

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 9. 15. 12:18

이번 포스팅에서는 Python pandas DataFrame을 만들려고 할 때 "ValueError: If using all scalar values, you must pass an index" 에러 해결 방안 4가지를 소개하겠습니다.

아래의 예처럼 dictionary로 키, 값 쌍으로 된 데이터를 pandas DataFrame으로 만들려고 했을 때, 모든 값이 스칼라 값(if using all scalar values) 일 경우에 "ValueError: If using all scalar values, you must pass an index" 에러가 발생합니다.

import pandas as pd

df = pd.DataFrame({'col_1': 1,

'col_2': 2})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-73d6f192ba2a> in <module>()
      1 df = pd.DataFrame({'col_1': 1, 
----> 2                   'col_2': 2})

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    273                                  dtype=dtype, copy=copy)
    274         elif isinstance(data, dict):
--> 275             mgr = self._init_dict(data, index, columns, dtype=dtype)
    276         elif isinstance(data, ma.MaskedArray):
    277             import numpy.ma.mrecords as mrecords

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    409             arrays = [data[k] for k in keys]
    410 
--> 411         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    412 
    413     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5494     # figure out the index, if necessary
   5495     if index is None:
-> 5496         index = extract_index(arrays)
   5497     else:
   5498         index = _ensure_index(index)

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in extract_index(data)
   5533 
   5534         if not indexes and not raw_lengths:
-> 5535             raise ValueError('If using all scalar values, you must pass'
   5536                              ' an index')
   5537 

ValueError: If using all scalar values, you must pass an index

이 에러를 해결하기 위한 4가지 방법을 차례대로 소개하겠습니다.

(1) 해결방안 1 : 인덱스 값을 설정해줌 (pass an index)

에러 메시지에 "you must pass an index" 라는 가이드라인대로 인덱스 값을 추가로 입력해주면 됩니다.

# (1) pass an index

df = pd.DataFrame({'col_1': 1,

'col_2': 2},

index = [0])

df

	col_1	col_2
0	1	2

물론 index 에 원하는 값을 입력해서 설정해줄 수 있습니다. index 에 'row_1' 이라고 해볼까요?

df = pd.DataFrame({'col_1': 1,

'col_2': 2},

index = ['row_1'])

df

	col_1	col_2
row_1	1	2

(2) 스칼라 값 대신 리스트 값을 입력 (use a list instead of scalar values)

입력하는 값(values)에 대괄호 [ ] 를 해주어서 리스트로 만들어준 값을 사전형의 값으로 사용하면 에러가 발생하지 않습니다.

# (2) use a list instead of scalar values

df2 = pd.DataFrame({'col_1': [1],

'col_2': [2]})

df2

	col_1	col_2
0	1	2

(3) pd.DataFrame.from_records([{'key': value}]) 를 사용해서 DataFrame 만들기

이때도 [ ] 로 해서 리스트 값을 입력해주어야 합니다. ( [ ] 빼먹으면 동일 에러 발생함)

# (3) use pd.DataFrame.from_records() with a list

df3 = pd.DataFrame.from_records([{'col_1': 1,

'col_2': 2}])

df3

	col_1	col_2
0	1	2

(4) pd.DataFrame.from_dict([{'key': value}]) 를 사용하여 DataFrame 만들기

(3)과 거의 유사한데요, from_records([]) 대신에 from_dict([]) 를 사용하였으며, 역시 [ ] 로 해서 리스트 값을 입력해주면 됩니다.

# (4) use pd.DataFrame.from_dict([]) with a list

df4 = pd.DataFrame.from_dict([{'col_1': 1,

'col_2': 2}])

df4

	col_1	col_2
0	1	2

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04
[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31
[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31

Posted by Rfriend

,

[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 31. 20:22

지난번 포스팅에서는 Python의 json.dumps() 를 사용해서 JSON 포맷 데이터를 쓰거나, json.loads()를 사용해서 JSON 포맷 데이터를 python으로 읽어오는 방법(https://rfriend.tistory.com/474)을 소개하였습니다.

이번 포스팅에서는 이어서 웹에 있는 JSON 포맷 데이터를 Python으로 읽어와서 pandas DataFrame으로 만드는 방법(How to read JSON formate data from WEB API and convert it to pandas DataFrame in python)을 소개하겠습니다.

JSON 포맷 파일을 가져올 수 있는 사이트로 "Awesome JSON Datasets (https://github.com/jdorfman/awesome-json-datasets)" 를 예로 들어서 설명해보겠습니다.

여러개의 JSON Datasets 이 올라가 있는데요, 이중에서 'Novel Prize' JSON 포맷 데이터(http://api.nobelprize.org/v1/prize.json)를 읽어와서 DataFrame으로 만들어보겠습니다.

(1) API 웹 사이트에서 JSON 포맷 자료를 Python으로 읽어오기

이제 urllib 모듈의 rulopen 함수를 사용해서 JSON 데이터가 있는 URL로 요청(request)을 보내서 URL을 열고 JSON 데이터를 읽어와서, python의 json.loads() 를 사용하여 novel_prize_json 이라 이름의 Python 객체로 만들어보겠습니다.

# parse a JSON string using json.loads() method : returns a dictionary

import json

import urllib

import pandas as pd

# API request to the URL

import sys

if sys.version_info[0] == 3:

from urllib.request import urlopen # for Python 3.x

else:

from urllib import urlopen # for Python 2.x

with urlopen("http://api.nobelprize.org/v1/prize.json") as url:

novel_prize_json_file = url.read()

urllib 모듈의 (web open) request 메소드를 불러오 때 Python 2.x 버전에서는 from urllib import urlopen 을 사용하는 반면, Python 3.x 버전에서는 from urllib.request import urlopen 을 사용합니다. 따라서 만약 Python 3.x 사용자가 아래처럼 (Python 2.x 버전에서 사용하는) from urllib import urlopen 이라고 urlopen을 importing 하려고 하면 ImportError: cannot import name 'urlopen' 이라는 에러가 납니다.

# ImportError: cannot import name 'urlopen' at python 3.x

from urllib import urlopen # It's only for Python 2.x. It's not working at Python 3.x

with urlopen("http://api.nobelprize.org/v1/prize.json") as url:

novel_prize_json_file = url.read()

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-81c8fae1a1fd> in <module>()
      1 # API request to the URL
----> 2 from urllib import urlopen
      3 
      4 with urlopen("http://api.nobelprize.org/v1/prize.json") as url:
      5     novel_prize_json_file = url.read()

ImportError: cannot import name 'urlopen'

다음으로, 위에서 읽어온 JSON 포맷 데이터를 Python의 json.loads() 메소드를 이용해서 decoding 해보겠습니다. 이때 decode('utf-8') 로 설정해주었습니다.

# decoding to python object

novel_prize_json = json.loads(novel_prize_json_file.decode('utf-8'))

decoding을 할 때 'utf-8' 을 설정을 안해주니 아래처럼 TypeError 가 나네요. (TypeError: the JSON object must be str, not 'builtin_function_or_method')

# decoding TypeError. decode using decode('utf-8')

novel_prize_json = json.loads(novel_prize_json_file.decode)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-1fda68be6386> in <module>()
      1 # decoding TypeError
----> 2 novel_prize_json = json.loads(novel_prize_json_file.decode)

C:\Users\admin\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    310     if not isinstance(s, str):
    311         raise TypeError('the JSON object must be str, not {!r}'.format(
--> 312                             s.__class__.__name__))
    313     if s.startswith(u'\ufeff'):
    314         raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",

TypeError: the JSON object must be str, not 'builtin_function_or_method'

Novel Prize JSON 파일을 Python 객체로 읽어왔으니, keys() 메소드로 키를 확인해보겠습니다. 그리고 'prizes' 키의 첫번째 데이터(novel_prize_json['prizes'][0])인 물리학(Physics) 분야 노벨상 수상자 정보를 인쇄해보겠습니다.

novel_prize_json.keys()

dict_keys(['prizes'])

novel_prize_json['prizes'][0].keys()

dict_keys(['laureates', 'year', 'overallMotivation', 'category'])

novel_prize_json['prizes'][0]

{'category': 'physics',
 'laureates': [{'firstname': 'Arthur',
   'id': '960',
   'motivation': '"for the optical tweezers and their application to biological systems"',
   'share': '2',
   'surname': 'Ashkin'},
  {'firstname': 'Gérard',
   'id': '961',
   'motivation': '"for their method of generating high-intensity, ultra-short optical pulses"',
   'share': '4',
   'surname': 'Mourou'},
  {'firstname': 'Donna',
   'id': '962',
   'motivation': '"for their method of generating high-intensity, ultra-short optical pulses"',
   'share': '4',
   'surname': 'Strickland'}],
 'overallMotivation': '"for groundbreaking inventions in the field of laser physics"',

'year': '2018'}

가독성을 높이기 위해서 json.dumps(obj, indent=4) 를 사용해서 4칸 들여쓰기 (indentation)을 해보겠습니다.

print(json.dumps(novel_prize_json['prizes'][0], indent=4))

{
    "laureates": [
        {
            "share": "2",
            "id": "960",
            "surname": "Ashkin",
            "motivation": "\"for the optical tweezers and their application to biological systems\"",
            "firstname": "Arthur"
        },
        {
            "share": "4",
            "id": "961",
            "surname": "Mourou",
            "motivation": "\"for their method of generating high-intensity, ultra-short optical pulses\"",
            "firstname": "G\u00e9rard"
        },
        {
            "share": "4",
            "id": "962",
            "surname": "Strickland",
            "motivation": "\"for their method of generating high-intensity, ultra-short optical pulses\"",
            "firstname": "Donna"
        }
    ],
    "year": "2018",
    "overallMotivation": "\"for groundbreaking inventions in the field of laser physics\"",
    "category": "physics"
}

키(keys)를 기준으로 정렬하는 것까지 포함해서 다시 한번 프린트를 해보겠습니다.

print(json.dumps(novel_prize_json['prizes'][0], indent=4, sort_keys=True))

{
    "category": "physics",
    "laureates": [
        {
            "firstname": "Arthur",
            "id": "960",
            "motivation": "\"for the optical tweezers and their application to biological systems\"",
            "share": "2",
            "surname": "Ashkin"
        },
        {
            "firstname": "G\u00e9rard",
            "id": "961",
            "motivation": "\"for their method of generating high-intensity, ultra-short optical pulses\"",
            "share": "4",
            "surname": "Mourou"
        },
        {
            "firstname": "Donna",
            "id": "962",
            "motivation": "\"for their method of generating high-intensity, ultra-short optical pulses\"",
            "share": "4",
            "surname": "Strickland"
        }
    ],
    "overallMotivation": "\"for groundbreaking inventions in the field of laser physics\"",
    "year": "2018"
}

(2) JSON 포맷 데이터를 pandas DataFrame으로 만들기

다음으로 Python으로 불러온 JSON 포맷의 데이터 중의 일부분을 indexing하여 pandas DataFrame으로 만들어보겠습니다.

예로, ['prizes'][0] 은 물리학(physics) 노벨상이며, ['prizes'][0]['laureates'] 로 물리학 노벨상 수상자 정보만 선별해서 pd.DataFrame() 으로 DataFrame을 만들어보겠습니다.

novel_prize_physics = pd.DataFrame(novel_prize_json['prizes'][0]["laureates"])

novel_prize_physics

	firstname	id	motivation	share	surname
0	Arthur	960	"for the optical tweezers and their applicatio...	2	Ashkin
1	Gérard	961	"for their method of generating high-intensity...	4	Mourou
2	Donna	962	"for their method of generating high-intensity...	4	Strickland

DataFrame을 만들 때 칼럼 순서를 columns 로 지정을 해줄 수 있습니다.

novel_prize_physics = pd.DataFrame(novel_prize_json['prizes'][0]["laureates"],

columns = ['id', 'firstname', 'surname', 'share', 'motivation'])

novel_prize_physics

	id	firstname	surname	share	motivation
0	960	Arthur	Ashkin	2	"for the optical tweezers and their applicatio...
1	961	Gérard	Mourou	4	"for their method of generating high-intensity...
2	962	Donna	Strickland	4	"for their method of generating high-intensity...

많은 도움이 되었기를 바랍니다 .

다음 포스팅에서는 웹에서 XML 포맷 데이터를 Python으로 읽어와서 pandas DataFrame으로 만드는 방법을 소개하겠습니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법 (0)	2019.09.15
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04
[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31
[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value) (0)	2019.08.28
[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기 (0)	2019.08.25

Posted by Rfriend

,

[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 31. 16:41

json.org의 JSON 소개 내용에 따르면, JSON (JavaScript Object Notation) 은 XML, YAML 과 함께 효율적으로 데이터를 저장하고 교환(exchange data)하는데 사용하는 텍스트 데이터 포맷 중의 하나입니다. JSON은 사람이 읽고 쓰기에 쉬우며, 또한 기계가 파싱하고 생성하기도에 쉽습니다. JSON은 그 이름에서 유추할 수 있듯이 JavaScript의 프로그래밍 언어의 부분에 기반하고 있으며, C-family 프로그램밍 언어 (C, C++, C#, Java, JavaScript, Perl, Python 등)의 규약을 따르고 있어서 C-family 프로그래밍 언어 간 데이터를 교환하는데 적합합니다.

JSON은 아래의 두개의 구조로 이루어져 있습니다.

이름/값 쌍의 집합 (A collection of name/value pairs): object, record, struct, dictionary, hash table, keyed list, associative array
정렬된 값의 리스트 (An ordered list of values): array, vector, list, sequence

홍길동이라는 이름의 학생에 대한 정보를 포함하고 있는 JSON 데이터 포맷의 예를 들어보겠습니다.

{

    "1.FirstName": "Gildong",
    "2.LastName": "Hong",
    "3.Age": 20,
    "4.University": "Yonsei University",
    "5.Courses": [
        {
            "Classes": [
                "Probability",
                "Generalized Linear Model",
                "Categorical Data Analysis"
            ],
            "Major": "Statistics"
        },
        {
            "Classes": [
                "Data Structure",
                "Programming",
                "Algorithms"
            ],
            "Minor": "ComputerScience"
        }
    ]
}

그러면, 이번 포스팅에서는

(1) Python 객체를 JSON 데이터로 쓰기, 직렬화, 인코딩 (Write Python object to JSON, Serialization, Encoding)

(2) JSON 포맷 데이터를 Python 객체로 읽기, 역직렬화, 디코딩 (Read JSON to Python, Deserialization, Decoding)

하는 방법을 소개하겠습니다.

위의 Python - JSON 간 변환 표(conversion table b/w python and JSON)에서 보는 바와 같이, python의 list, tuple 이 JSON의 array로 변환되며, JSON의 array는 pythonhon의 list로 변환됩니다. 따라서 Python의 tuple을 JSON으로 변환하면 JSON array가 되며, 이를 다시 Python으로 재변환하면 이땐 python의 tuple이 아니라 list로 변환된다는 점은 인식하고 사용하기 바랍니다.

(1) Python 객체를 JSON 데이터로 쓰기, 직렬화, 인코딩: json.dumps()
(Write Python object to JSON, Serialization, Encoding)

python 객체를 JSON 데이터로 만들어서 쓰기 위해서는 파이썬의 내장 json 모듈이 필요합니다.

import json

아래와 같은 홍길동 이라는 학생의 정보를 담고 있는 사전형 자료(dictionary)를 json.dump()와 json.dumps() 의 두가지 방법으로 JSON 포맷 데이터로 만들어보겠습니다.

student_data = {
"1.FirstName": "Gildong",
"2.LastName": "Hong",
"3.Age": 20,
"4.University": "Yonsei University",
"5.Courses": [
{
"Major": "Statistics",
"Classes": ["Probability",
"Generalized Linear Model",
"Categorical Data Analysis"]
},
{
"Minor": "ComputerScience",
"Classes": ["Data Structure",
"Programming",
"Algorithms"]
}
]
}

(2-1) with open(): json.dump() 를 사용해서 JSON 포맷 데이터를 디스크에 쓰기

with open("student_file.json", "w") 로 "student_file.json" 이름의 파일을 쓰기("w") 모드로 열어놓고, json.dump(student_data, json_file) 로 직렬화해서 JSON으로 내보내고자 하는 객체 student_data를, 직렬화된 데이터가 쓰여질 파일 json_file 에 쓰기를 해주었습니다.

import json

with open("student_file.json", "w") as json_file:

json.dump(student_data, json_file)

그러면 아래의 화면캡쳐에서 보는 바와 같이 'student_file.json'이라는 이름의 JSON 포맷 데이터가 새로 생성되었음을 알 수 있습니다.

(2-2) json.dumps()를 사용해서 JSON 포맷 데이터를 메모리에 만들기

만약 메모리 상에 JSON 포맷 데이터를 만들어놓고 python에서 계속 작업을 하려면 json.dumps() 를 사용합니다.

import json

st_json = json.dumps(student_data)

print(st_json)

{"5.Courses": [{"Classes": ["Probability", "Generalized Linear Model", "Categorical Data Analysis"], "Major": "Statistics"}, {"Minor": "ComputerScience", "Classes": ["Data Structure", "Programming", "Algorithms"]}], "3.Age": 20, "2.LastName": "Hong", "4.University": "Yonsei University", "1.FirstName": "Gildong"}

이때 만약 json.dumps()가 아니라 json.dump() 처럼 's'를 빼먹으면 TypeError가 발생하므로 주의하세요.

# use json.dumps() instead of json.dump()

st_json = json.dump(student_data)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-ac881d00bbcb> in <module>()
----> 1 st_json = json.dump(student_data)

TypeError: dump() missing 1 required positional argument: 'fp'

json.dumps()로 파이썬 객체를 직렬화해서 JSON으로 쓸 때 사람이 보기에 좀더 쉽도록 'indent = int' 로 들여쓰기(indentation) 옵션을 설정할 수 있습니다. 아래 예시는 indent=4 로 설정한건데요, 한결 보기에 가독성이 좋아졌습니다.

import json

st_json2 = json.dumps(student_data, indent=4)

print(st_json2)

{

    "3.Age": 20,
    "5.Courses": [
        {
            "Classes": [
                "Probability",
                "Generalized Linear Model",
                "Categorical Data Analysis"
            ],
            "Major": "Statistics"
        },
        {
            "Minor": "ComputerScience",
            "Classes": [
                "Data Structure",
                "Programming",
                "Algorithms"
            ]
        }
    ],
    "1.FirstName": "Gildong",
    "4.University": "Yonsei University",
    "2.LastName": "Hong"
}

'sort_keys=True' 를 설정해주면 키(keys)를 기준으로 정렬해서 직렬화하여 내보낼 수도 있습니다.

import json

st_json3 = json.dumps(student_data, indent=4, sort_keys=True)

print(st_json3)

{
    "1.FirstName": "Gildong",
    "2.LastName": "Hong",
    "3.Age": 20,
    "4.University": "Yonsei University",
    "5.Courses": [
        {
            "Classes": [
                "Probability",
                "Generalized Linear Model",
                "Categorical Data Analysis"
            ],
            "Major": "Statistics"
        },
        {
            "Classes": [
                "Data Structure",
                "Programming",
                "Algorithms"
            ],
            "Minor": "ComputerScience"
        }
    ]
}

(2) JSON 포맷 데이터를 Python 객체로 읽기, 역직렬화, 디코딩: json.loads()
(Read JSON to Python, Deserialization, Decoding)

(2-1) 디스크에 있는 JSON 포맷 데이터를 json.load()를 사용하여 Python 객체로 읽어오기 (역직렬화, 디코딩 하기)

이어서, (1)번에서 with open(): json.dump() 로 만들어놓은 JSON 포맷의 데이터 "student_file.json" 를 Python 으로 역질렬화(deserialization)해서 읽어와 보겠습니다. with open("student_file.json", "r") 로 읽기 모드("r")로 JSON파일을 열어 후에, json.load(st_json)으로 디코딩하였습니다.

import json

with open("student_file.json", "r") as st_json:

st_python = json.load(st_json)

st_python

{'1.FirstName': 'Gildong',
 '2.LastName': 'Hong',
 '3.Age': 20,
 '4.University': 'Yonsei University',
 '5.Courses': [{'Classes': ['Probability',
    'Generalized Linear Model',
    'Categorical Data Analysis'],
   'Major': 'Statistics'},
  {'Classes': ['Data Structure', 'Programming', 'Algorithms'],
   'Minor': 'ComputerScience'}]}

이때 json.loads() 처럼 's'를 붙이면 TypeError: the JSON object must be str, not 'TextIOWrapper'가 발생합니다. (json.loads()가 아니라 json.load() 를 사용해야 함)

# use json.load() instead of json.loads()

with open("student_json_file.json", "r") as st_json:

st_python = json.loads(st_json)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-53-c39634419df6> in <module>()
      1 with open("student_json_file.json", "r") as st_json:
----> 2     st_python = json.loads(st_json)

C:\Users\admin\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    310     if not isinstance(s, str):
    311         raise TypeError('the JSON object must be str, not {!r}'.format(
--> 312                             s.__class__.__name__))
    313     if s.startswith(u'\ufeff'):
    314         raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",

TypeError: the JSON object must be str, not 'TextIOWrapper'

(2-2) 메모리에 있는 JSON 포맷 데이터를 json.loads()로 Python 객체로 읽기 (역직렬화, 디코딩하기)

import json

st_python2 = json.loads(st_json3)

st_python2

{'1.FirstName': 'Gildong',
 '2.LastName': 'Hong',
 '3.Age': 20,
 '4.University': 'Yonsei University',
 '5.Courses': [{'Classes': ['Probability',
    'Generalized Linear Model',
    'Categorical Data Analysis'],
   'Major': 'Statistics'},
  {'Classes': ['Data Structure', 'Programming', 'Algorithms'],
   'Minor': 'ComputerScience'}]}

이때 만약 json.loads() 대신에 's'를 빼고 json.load()를 사용하면 AttributeError: 'str' object has no attribute 'read' 가 발생하니 주의하기 바랍니다.

# use json.loads() instead of json.load()

st_python2 = json.load(st_json3)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)
<ipython-input-54-9de49903fef6> in <module>()
----> 1 st_python2 = json.load(st_json3)

C:\Users\admin\Anaconda3\lib\json\__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    263 
    264     """
--> 265     return loads(fp.read(),
    266         cls=cls, object_hook=object_hook,
    267         parse_float=parse_float, parse_int=parse_int,

AttributeError: 'str' object has no attribute 'read'

많은 도움이 되었기를 바랍니다.

다음번 포스팅에서는 '웹(API)으로 부터 JSON 포맷 자료를 Python으로 읽어와서 pandas DataFrame으로 만드는 방법(https://rfriend.tistory.com/475)을 소개하겠습니다.

Python으로 XML 파일 읽기, 쓰기는 https://rfriend.tistory.com/477 를 참고하세요.

Python으로 YAML 파일 읽기, 쓰기는 https://rfriend.tistory.com/540 를 참고하세요.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04
[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31
[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value) (0)	2019.08.28
[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기 (0)	2019.08.25
[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype() (6)	2019.08.25

Posted by Rfriend

,

[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 28. 23:43

이전 포스팅에서 사전 자료형(dictionary data type)을 만드는 방법과 기본 사용법(https://rfriend.tistory.com/333), 사전 자료형 내장함수와 메소드(https://rfriend.tistory.com/334)에 대해서 설명한 적이 있습니다.

사전 자료형을 중괄호로 묶여서 {키(key) : 값(value)} 의 쌍으로 이루어진다고 했으며, hash table type 으로 키를 해싱해놓고 있다가, 키를 기준으로 값을 찾으려고 하면 매우 빠른 속도로 찾아주는 효율적이고 편리한 자료형이라고 소개를 했었습니다.

이번 포스팅에서는 사전 자료형을

(1) 키를 기준으로 오름차순 정렬 (sort by key in ascending order)

(2) 키를 기준으로 내림차순 정렬 (sort by key in descending order)

(3) 값을 기준으로 오름차순 정렬 (sort by value in ascending order)

(4) 값을 기준으로 내림차순 정렬 (sort by value in descending order)

하는 방법을 알아보겠습니다.

{키 : 값} 쌍으로 이루어진 간단한 예제 사전 자료형을 만들어보겠습니다. (프로그래밍 언어별 인기도)

# make a dictionary

pgm_lang = {

"java": 20,

"javascript": 8,

"c": 7,

"r": 4,

"python": 28 }

(1) 키를 기준으로 오름차순 정렬 (sort by key in ascending order): sorted()

pgm_lang 사전형 자료를 키(key)를 기준으로 알파벳 오름차순으로 정렬해보겠습니다. dict.keys()를 정렬하면 keys 만 정렬된 값을 반환하며, dict.items()를 정렬하면 키(key)를 기준으로 정렬하되 키와 값을 튜플로 묶어서 정렬된 값을 반환합니다.

sorted(pgm_lang.keys())

['c', 'java', 'javascript', 'python', 'r']

sorted(pgm_lang.items())

[('c', 7), ('java', 20), ('javascript', 8), ('python', 28), ('r', 4)]

그러면, 키를 기준으로 정렬한 (키, 값) 쌍의 튜플을 for loop 을 써서 한줄씩 인쇄를 해보겠습니다.

첫번째 예는 일반적인 for loop이며, 두번째는 list comprehension 을 이용하여 같은 결과로 인쇄한 예입니다.

for key, value in sorted(pgm_lang.items()):

print(key, ":", value)

c : 7
java : 20
javascript : 8
python : 28
r : 4

# list comprehension

[print(key, ":", value) for (key, value) in sorted(pgm_lang.items())]

c : 7
java : 20
javascript : 8
python : 28
r : 4

[None, None, None, None, None]

키를 기준으로 정렬할 때, lambda 함수를 사용하여 새롭게 정의한 키(custom key function, logic)를 사용할 수도 있습니다. 아래 예는 키의 길이(length)를 기준으로 오름차순 정렬을 해본 것입니다.

# sort a dictionary by custom key function (eg. by the length of key strings)

pgm_lang_len = sorted(pgm_lang.items(), key = lambda item: len(item[0])) # key: [0]

for key, value in pgm_lang_len:

print(key, ":", value)

c : 7
r : 4
java : 20
python : 28
javascript : 8

(2) 키를 기준으로 내림차순 정렬 (sort by key in descending order): reverse=True

내림차순으로 정렬하려면 reverse=True 옵션을 추가해주면 됩니다.

# sorting in reverse order

sorted(pgm_lang.keys(), reverse=True)

['r', 'python', 'javascript', 'java', 'c']

for (key, value) in sorted(pgm_lang.items(), reverse=True):

print(key, ":", value)

r : 4
python : 28
javascript : 8
java : 20
c : 7

(3) 값을 기준으로 오름차순 정렬 (sort by value in ascending order)

값(value)을 기준으로 정렬하려면 앞서 소개했던 lambda 함수를 이용하여 키(key)로 사용할 기준이 값(value), 즉 item[1] 이라고 지정을 해주면 됩니다. (키는 item[0], 값은 item[1] 로 indexing)

sorted(pgm_lang.items(), key = lambda item: item[1]) # value: [1]

[('r', 4), ('c', 7), ('javascript', 8), ('java', 20), ('python', 28)]

값을 기준으로 오름차순 정렬한 결과를 for loop을 같이 사용하여 (키 : 값) 한쌍씩 차례대로 프린트를 해보겠습니다. (list comprehension 을 사용해도 결과 동일)

pgm_lang_val = sorted(pgm_lang.items(), key = lambda item: item[1])

for key, value in pgm_lang_val:

print(key, ":", value)

r : 4
c : 7
javascript : 8
java : 20
python : 28

# list comprehension

[print(key, ":", value) for (key, value) in sorted(pgm_lang.items(), key = lambda item: item[1])]

r : 4
c : 7
javascript : 8
java : 20
python : 28

[None, None, None, None, None]

(4) 값을 기준으로 내림차순 정렬 (sort by value in descending order)

위의 (3)번과 같이 값을 기준으로 정렬하므로 key=lambda x:x[1] 로 값(value, 즉 x[1])이 정렬 기준이라고 지정을 해주구요, 대신 내림차순이므로 reverse=True 옵션을 추가해주면 됩니다.

# in reverse order

pgm_lang_val_reverse = sorted(pgm_lang.items(),

reverse=True,

key=lambda item: item[1])

for key, value in pgm_lang_val_reverse:

print(key, ":", value)

python : 28
java : 20
javascript : 8
c : 7
r : 4

바로 위의 예에서는 for loop 순환할 때 key, value 2개 객체로 따로 따로 받았는데요, 바로 아래의 예처럼 items 로 (키, 값) 튜플로 받아서 items[0] 으로 키, items[1] 로 값을 indexing 할 수도 있습니다.

# the same result with above

for items in pgm_lang_val_reverse:

print(items[0], ":", items[1])

python : 28
java : 20
javascript : 8
c : 7
r : 4

이상으로 사전 자료형(dictionary)의 키, 값 기준 정렬(sorted)하는 4가지 방법 소개를 마치겠습니다.

DataFrame, Tuple, List 정렬 방법은 https://rfriend.tistory.com/281 를 참고하세요.
Numpy 배열 정렬 np.sort() 는 http://rfriend.tistory.com/357 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31
[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31
[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기 (0)	2019.08.25
[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype() (6)	2019.08.25
[Python pandas] DataFrame의 칼럼 이름 바꾸기 : df.columns = [], df.rename(columns) (2)	2019.08.14

Posted by Rfriend

,

[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype()

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 25. 16:54

이번 포스팅에서는 Python pandas DataFrame 이나 Series 내 문자열 칼럼을 숫자형으로 변환(how to convert string columns to numeric data types in pandas DataFrame, Series) 하는 2가지 방법을 소개하겠습니다.

(1) pd.to_numeric() 함수를 이용한 문자열 칼럼의 숫자형 변환

(2) astype() 메소드를 이용한 문자열 칼럼의 숫자형 변환

(1) pd.to_numeric() 함수를 이용한 문자열 칼럼의 숫자형 변환

(1-1) 한개의 문자열 칼럼을 숫자형으로 바꾸기

import numpy as np

import pandas as pd

# make a DataFrame

df = pd.DataFrame({'col_str': ['1', '2', '3', '4', '5']})

df

	col_str
0	1
1	2
2	3
3	4
4	5

# check data types

df.dtypes

col_str    object
dtype: object

df['col_int'] = pd.to_numeric(df['col_str'])

df

	col_str	col_int
0	1	1
1	2	2
2	3	3
3	4	4
4	5	5

df.dtypes

col_str    object
col_int     int64
dtype: object

(1-2) apply() 함수와 to_numeric() 함수를 사용해 DataFrame 내 다수의 문자열 칼럼을 숫자형으로 바꾸기

# make a DataFrame with 3 string columns

df2 = pd.DataFrame({'col_str_1': ['1', '2', '3'],

'col_str_2': ['4', '5', '6'],

'col_str_3': ['7.0', '8.1', '9.2']})

df2

	col_str_1	col_str_2	col_str_3
0	1	4	7.0
1	2	5	8.1
2	3	6	9.2

df2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
dtype: object

# convert 'col_str_1' and 'col_str_2' to numeric

df2[['col_int_1', 'col_int_2']] = df2[['col_str_1', 'col_str_2']].apply(pd.to_numeric)

df2

	col_str_1	col_str_2	col_str_3	col_int_1	col_int_2
0	1	4	7.0	1	4
1	2	5	8.1	2	5
2	3	6	9.2	3	6

df2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
col_int_1     int64
col_int_2     int64
dtype: object

# convert all columns of a DataFrame to numeric using apply() and to_numeric together

df3 = df2.apply(pd.to_numeric)

df3.dtypes

col_str_1      int64
col_str_2      int64
col_str_3    float64
col_int_1      int64
col_int_2      int64
dtype: object

(2) astype() 메소드를 이용한 문자열 칼럼의 숫자형 변환

(2-1) DataFrame 내 모든 문자열 칼럼을 float로 한꺼번에 변환하기

df4 = pd.DataFrame({'col_str_1': ['1', '2', '3'],

'col_str_2': ['4.1', '5.5', '6.0']})

df4.dtypes

col_str_1    object
col_str_2    object
dtype: object

df5 = df4.astype(float)

df5

	col_str_1	col_str_2
0	1.0	4.1
1	2.0	5.5
2	3.0	6.0

df5.dtypes

col_str_1    float64
col_str_2    float64
dtype: object

(2-2) DataFrame 내 문자열 칼럼별로 int, float 데이터 형식 개별 지정해서 숫자형으로 변환하기

df6 = df4.astype({'col_str_1': int,

'col_str_2': np.float})

df6

	col_str_1	col_str_2
0	1	4.1
1	2	5.5
2	3	6.0

df6.dtypes

col_str_1      int64
col_str_2    float64
dtype: object

DataFrame에 문자가 포함된 칼럼이 같이 있을 경우 ValueError

물론 DataFrame 내의 문자열 중에서 숫자가 아니라 문자(character)로 이루어진 문자열(string)이 포함되어 있을 경우 apply(pd.to_numeric) 함수나 DataFrame.astype(int) 메소드를 써서 한꺼번에 숫자형 데이터 형태로 변환하려고 하면 ValueError 가 발생합니다. (너무 당연한 거라서 여기에 써야 하나 싶기도 한데요... ^^;)

이럴 때는 숫자만 들어있는 문자열 칼럼만을 선택해서 개별적으로 변환을 해주면 됩니다.

아래는 문자로만 구성된 문자열 'col_2' 를 포함한 df7 데이터프레임을 만들어서 전체 칼럼을 숫자형으로 바꾸려고 했을 때 ValueError 가 발생한 예입니다.

df7 = pd.DataFrame({'col_1': ['1', '2', '3'],

'col_2': ['aaa', 'bbb', 'ccc']})

df7

	col_1	col_2
0	1	aaa
1	2	bbb
2	3	ccc

df7.dtypes

col_1    object
col_2    object
dtype: object

* ValueError

# ValueError

df7 = df7.apply(pd.to_numeric)

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "aaa" During handling of the above exception, another exception occurred: -- 중간 생략 -- ~/anaconda3/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast) 124 coerce_numeric = False if errors in ('ignore', 'raise') else True 125 values = lib.maybe_convert_numeric(values, set(), --> 126 coerce_numeric=coerce_numeric) 127 128 except Exception: pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: ('Unable to parse string "aaa" at position 0', 'occurred at index col_2')

* ValueError

# ValueError

df7 = df7.astype(int)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-124-f50dad302c83> in <module>()
----> 1 df7 = df7.astype(int)

-- 중간 생략 --

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    623     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    624         # work around NumPy brokenness, #1987
--> 625         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    626 
    627     if dtype.name in ("datetime64", "timedelta64"):

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src/util.pxd in util.set_value_at_unsafe()

ValueError: invalid literal for int() with base 10: 'aaa'

문자열을 숫자형으로 변환 시 ValueError 를 무시하기: df.apply(pd.to_numeric, errors = 'coerce')

위의 예와는 조금 다르게 문자형을 숫자형으로 변환하려는 칼럼이 맞는데요, 값 중에 몇 개가 실수로 숫자로 된 문자열이 아니라 문자로 된 문자열이 몇 개 포함되어 있다고 해봅시다. 이럴 경우 문자열을 숫자로 파싱할 수 없다면서 ValueError가 발생하는데요, 문자가 포함되어 있는 경우는 강제로 'NaN'으 값으로 변환하고, 나머지 숫자로된 문자열은 숫자형으로 변환해주려면 errors = 'coerce' 옵션을 추가해주면 됩니다.

df8 = pd.DataFrame({'col_1': ['1', '2', '3'],

'col_2': ['4', 'bbb', '6']})

df8

	col_1	col_2
0	1	4
1	2	bbb
2	3	6

df8 = df8.apply(pd.to_numeric)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "bbb"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-130-9e8d711c10d5> in <module>()
----> 1 df8 = df8.apply(pd.to_numeric)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4260                         f, axis,
   4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
   4263             else:
   4264                 return self._apply_broadcast(f, axis)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4356             try:
   4357                 for i, v in enumerate(series_gen):
-> 4358                     results[i] = func(v)
   4359                     keys.append(v.name)
   4360             except Exception as e:

~/anaconda3/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    124             coerce_numeric = False if errors in ('ignore', 'raise') else True
    125             values = lib.maybe_convert_numeric(values, set(),
--> 126                                                coerce_numeric=coerce_numeric)
    127 
    128     except Exception:

pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: ('Unable to parse string "bbb" at position 1', 'occurred at index col_2')

df8 = df8.apply(pd.to_numeric, errors = 'coerce')

df8

	col_1	col_2
0	1	4.0
1	2	NaN
2	3	6.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value) (0)	2019.08.28
[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기 (0)	2019.08.25
[Python pandas] DataFrame의 칼럼 이름 바꾸기 : df.columns = [], df.rename(columns) (2)	2019.08.14
[Python pandas] DataFrame 을 Excel로 내보내기 (write DataFrame to Excel): pd.DataFrame.to_excel() (4)	2019.08.06
[Python pandas] Python으로 엑셀 데이터 불러와서 DataFrame으로 만들기 (How to read Excel data using Python pandas) (4)	2019.07.31

Posted by Rfriend

,

[Python pandas] DataFrame의 칼럼 이름 바꾸기 : df.columns = [], df.rename(columns)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 14. 00:37

이번 포스팅에서는 Python pandas DataFrame의 칼럼 이름 바꾸는 방법(how to change column name in python pandas DataFrame), index 이름을 바꾸는 방법(how to change index name in python pandas DataFrame)을 소개하겠습니다.

(1) pandas DataFrame의 칼럼 이름 바꾸기

: df.columns = ['a', 'b']

: df.rename(columns = {'old_nm' : 'new_nm'}, inplace = True)

(2) pandas DataFrame의 인덱스 이름 바꾸기

: df.index = ['a', 'b']

: df.rename(index = {'old_nm': 'new_nm'}, inplace = True)

(1) Python pandas DataFrame 의 칼럼 이름 바꾸기

예제로 사용할 간단한 pandas DataFrame을 만들어보겠습니다.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'id': ['a', 'b', 'c', 'd'],
   ...: 'col_1': [1, 2, 3, 4],
   ...: 'col_2': [1, 1, 2, 2]},
   ...: columns = ['id', 'col_1', 'col_2'])

In [3]: df
Out[3]:
id col_1 col_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

In [4]: df.columns

Out[4]: Index(['id', 'col_1', 'col_2'], dtype='object')

(1-1) df.columns = ["new_1", "new_2"] 를 이용한 칼럼 이름 바꾸기

In [5]: df.columns = ["group", "val_1", "val_2"]

In [6]: df
Out[6]:
group val_1 val_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

df.columns 메소드를 사용해서 칼럼 이름을 변경하고자 하는 경우, DataFrame의 칼럼 개수 (number of columns in DataFrame)를 정확하게 일치시켜주어야 합니다. DataFrame의 칼럼 개수와 df.columns = [xx, xx, ...] 의 칼럼 개수가 서로 다를 경우 ValueError: Length mismatch 에러가 발생합니다.

# need to match the number of columns
# ValueError: Length mismatch
In [7]: df.columns = ["group", "val_1"] # length mismatch error
...:
Traceback (most recent call last):

File "<ipython-input-7-5ab3ecd42fe8>", line 1, in <module>
df.columns = ["group", "val_1"]
... 중간 생략 ...

File "C:\Users\admin\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 155, in set_axis
'values have elements'.format(old=old_len, new=new_len))

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

(1-2) df.rename(columns = {"old_1": "new_1", "old_2": "new_2"}, inplace=True) 를 이용하여 칼럼 이름 변경하기

In [8]: df.rename(columns = {"id": "group",
   ...: "col_1": "val_1",
   ...: "col_2": "val_2"}, inplace = True)
   ...:
In [9]: df

Out[9]:
group val_1 val_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

df.columns 메소드와는 달리 df.rename(columns = {'old': 'new'}) 함수는 DataFrame의 칼럼 개수를 맞추어줄 필요가 없으며, 특정 칼럼 이름만 선별적으로 바꿀 수 있습니다. 아래 예제는 "group" 칼럼 이름을 "ID_2" 라는 새로운 칼럼 이름으로 바꾸어준 예입니다.

In [10]: df.rename(columns = {"group": "ID_2"}, inplace = True)

In [11]: df
Out[11]:
ID_2 val_1 val_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

lambda 함수를 사용하여서 기존 DataFrame의 칼럼 앞에 "X_" 라는 접두사(prefix)를 붙인 새로운 칼럼 이름을 만들어보겠습니다.

In [14]: df = pd.DataFrame({'id': ['a', 'b', 'c', 'd'],
    ...: 'col_1': [1, 2, 3, 4],
    ...: 'col_2': [1, 1, 2, 2]},
    ...: columns = ['id', 'col_1', 'col_2'])

In [15]: df
Out[15]:
id col_1 col_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

In [16]: df.rename(columns = lambda x: "X_" + x, inplace = True)

In [17]: df
Out[17]:
X_id X_col_1 X_col_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

(2) DataFrame의 Index 이름 바꾸기

(2-1) df.index = ['new_idx1', 'new_idx2'] 을 이용하여 Index 이름 바꾸기

이때 DataFrame의 index 개수와 바꾸고자 하는 index 이름의 개수를 서로 맞추어주어야 합니다.

In [17]: df
Out[17]:
X_id X_col_1 X_col_2
0 a 1 1
1 b 2 1
2 c 3 2
3 d 4 2

In [18]: df.index
Out[18]: RangeIndex(start=0, stop=4, step=1)

In [19]: df.index = ['a', 'b', 'c', 'd']

In [20]: df
Out[20]:
X_id X_col_1 X_col_2
a a 1 1
b b 2 1
c c 3 2
d d 4 2

(2-2) df.rename(index = {'old_idx': 'new_idx'}, inplace = True) 를 이용한 index 이름 바꾸기

In [21]: df.rename(index = {0: 'a',
    ...: 1: 'b',
    ...: 2: 'c',
    ...: 3: 'd'}, inplace = True)

In [22]: df
Out[22]:
X_id X_col_1 X_col_2
a a 1 1
b b 2 1
c c 3 2
d d 4 2

pandas DataFrame의 칼럼 순서 변경하기는 https://rfriend.tistory.com/680 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~

'를 꾹 눌러주세요. :-)

728x90

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기 (0)	2019.08.25
[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype() (6)	2019.08.25
[Python pandas] DataFrame 을 Excel로 내보내기 (write DataFrame to Excel): pd.DataFrame.to_excel() (4)	2019.08.06
[Python pandas] Python으로 엑셀 데이터 불러와서 DataFrame으로 만들기 (How to read Excel data using Python pandas) (4)	2019.07.31
[Python pandas] DataFrame에서 천 단위 숫자의 자리 구분 기호 콤마(',')를 없애는 방법 (8)	2019.07.30

Posted by Rfriend

,

[Python Exceptions] try, except, else, finally 절을 이용한 Python Programming 예외 처리

Python 분석과 프로그래밍/Python 프로그래밍 2019. 8. 8. 00:00

Python 으로 프로그래밍을 하다보면 의도했던, 의도하지 않았던 간에 예외를 처리해야 하는 상황에 맞닥드리게 됩니다. 이때 에러를 발생(Raise Error) 시킬 수도 있고 회피할 수도 있으며, 필요에 따라서는 예외 처리를 해야 할 수도 있습니다.

이번 포스팅에서는 Python에서 예외를 처리하는 4가지 try, except, else, finally 절(clause)을 소개하겠습니다.

try 절에는 정상적으로 실행할 프로그램 코드(code block)를 써줍니다.
except 절에는 앞의 try 절에서 실행한 코드에 예외가 발생했을 경우에 실행할 코드를 써줍니다.
else 절에는 앞의 try 절에서 실행한 코드에 예외가 발생하지 않은 경우에 실행할 코드를 써줍니다.
finally 절에는 try 절에서의 예외 발생 여부에 상관없이 항상(always execute) 마지막에 실행할 코드를 써줍니다.

간단한 예로서, 두개의 숫자를 매개변수로 받아서 나눗셈을 하는 사용자 정의함수를 try, except, else, finally 의 예외절을 사용하여 정의해 보겠습니다.

try, except, else, finally 절의 끝에는 콜론(:)을 붙여주며, 그 다음줄에 코드 블락은 들여쓰기(indentation)을 해줍니다.

except의 경우 'except:' 만 사용해도 되고, 아래 예의 'except ZeorDivisionError as zero_div_err:' 처럼 Built-in Exception Class를 사용해서 에러를 명시적으로 써줄 수도 있습니다. (본문의 제일 밑에 Python Built-in Exception Class 의 계층 구조 참조)

# Python Exceptions: try, except, else, finally

def divide(x, y):

try:

result = x / y

except ZeroDivisionError as zero_div_err:

print("Except clause:", zero_div_err)

else:

print("Else clause: The result of division is", result)

finally:

print("Finally clause is executed.")

(1) try 절 정상 실행 시 (executed nomally): try --> else --> finally

두 숫자 x, y를 인자로 받아서 나눗셈을 하는 사용자 정의함수 divide(x, y) 에 x=1, y=2 를 넣어서 정상적으로 코드가 수행되었을 때의 결과를 보겠습니다. 마지막으로 finally 절이 실행되었습니다.

In [1]: def divide(x, y):

...: try:

...: result = x / y

...: except ZeroDivisionError as zero_div_err:

...: print("Except clause:", zero_div_err)

...: else:

...: print("Else clause: The result of division is", result)

...: finally:

...: print("Finally clause is executed.")

...:

In [2]: divide(1, 2)

Else clause: The result of division is 0.5

Finally clause is executed.

(2) try 절 실행 중 예외 발생 시 (exception occurred): try --> except --> finally

1을 '0'으로 나누라고 하면 'except ZeroDivisionError as zero_div_err:' 의 except 절이 실행됩니다. 그리고 마지막으로 finally 절이 실행되었습니다.

In [3]: divide(1, 0)

Except clause: division by zero

Finally clause is executed.

(3) try 절 Error 발생 시: try --> finally --> Raise Error

divide(x, y) 사용자 정의함수에 숫자를 인자로 받는 매개변수에 문자열(string)을 입력하는 경우 TypeError가 발생하겠지요? 이때 finally를 먼저 실행하고, 그 후에 TypeError 를 발생시킵니다.

In [4]: divide("1", "2")

Finally clause is executed.

Traceback (most recent call last):

File "<ipython-input-4-bbf78a5b43b9>", line 1, in <module>

divide("1", "2")

File "<ipython-input-1-78cbb56f9746>", line 3, in divide

result = x / y

TypeError: unsupported operand type(s) for /: 'str' and 'str'

Traceback (most recent call last):

File "<ipython-input-4-bbf78a5b43b9>", line 1, in <module>

divide("1", "2")

File "<ipython-input-1-78cbb56f9746>", line 3, in divide

result = x / y

TypeError: unsupported operand type(s) for /: 'str' and 'str'

(4) Try Except 절에서 에러 발생 시 Exception Error 이름과 Error message 출력

import numpy as np

arr = np.array([[1., 2., 3.], [4.0, 5., 6.]])

print(arr)

[Out]

[[1. 2. 3.]
 [4. 5. 6.]]

# It will not work and eraise an error

try:

arr.reshape(5, -1)

except Exception as e:

print(f"{type(e).__name__}: {e}")

[Out]

ValueError: cannot reshape array of size 6 into shape (5,newaxis)

[ 참고: The Class Hierarchy of Python Built-in Exceptions ]

(* source: https://docs.python.org/3/library/exceptions.html)

BaseException

+-- SystemExit

+-- KeyboardInterrupt

+-- GeneratorExit

+-- Exception

+-- StopIteration

+-- StopAsyncIteration

+-- ArithmeticError

| +-- FloatingPointError

| +-- OverflowError

| +-- ZeroDivisionError

+-- AssertionError

+-- AttributeError

+-- BufferError

+-- EOFError

+-- ImportError

| +-- ModuleNotFoundError

+-- LookupError

| +-- IndexError

| +-- KeyError

+-- MemoryError

+-- NameError

| +-- UnboundLocalError

+-- OSError

| +-- BlockingIOError

| +-- ChildProcessError

| +-- ConnectionError

| | +-- BrokenPipeError

| | +-- ConnectionAbortedError

| | +-- ConnectionRefusedError

| | +-- ConnectionResetError

| +-- FileExistsError

| +-- FileNotFoundError

| +-- InterruptedError

| +-- IsADirectoryError

| +-- NotADirectoryError

| +-- PermissionError

| +-- ProcessLookupError

| +-- TimeoutError

+-- ReferenceError

+-- RuntimeError

| +-- NotImplementedError

| +-- RecursionError

+-- SyntaxError

| +-- IndentationError

| +-- TabError

+-- SystemError

+-- TypeError

+-- ValueError

| +-- UnicodeError

| +-- UnicodeDecodeError

| +-- UnicodeEncodeError

| +-- UnicodeTranslateError

+-- Warning

+-- DeprecationWarning

+-- PendingDeprecationWarning

+-- RuntimeWarning

+-- SyntaxWarning

+-- UserWarning

+-- FutureWarning

+-- ImportWarning

+-- UnicodeWarning

+-- BytesWarning

+-- ResourceWarning

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 프로그래밍' 카테고리의 다른 글

[Python] if else 조건절과 for loop 순환문 예시 (0)	2023.12.19
[Python] Python 객체를 직렬화해서 AWS S3에 저장하기 (0)	2022.01.03
[Python] 가변 매개변수(variable-length arguments) 위치에 따른 Keyword 매개변수 호출 시 SyntaxError, TypeError (0)	2019.08.03
[Python] for loop 반복문의 진척율을 콘솔창에 출력해서 확인하는 방법 (1)	2019.07.13
[Python] 함수나 클래스의 구현을 미룰 때 쓰는 pass 문 (4)	2018.07.24

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'파이썬'에 해당되는 글 151건

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 행의 개수가 많을 경우 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 처음 5개행과 마지막 5개 행만 선별해서 보여줍니다.

# if there are many rows, JN does not print all rows

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))
df

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[R] Jupyter Notebook에서 R 사용하기

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 칼럼 이름 바꾸기 : df.columns = [], df.rename(columns)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Exceptions] try, except, else, finally 절을 이용한 Python Programming 예외 처리

'Python 분석과 프로그래밍 > Python 프로그래밍' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

'파이썬'에 해당되는 글 151건

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 행의 개수가 많을 경우 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 처음 5개행과 마지막 5개 행만 선별해서 보여줍니다.

# if there are many rows, JN does not print all rows

import pandas as pdimport numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))df

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 프로그래밍' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))
df