'pandas DataFrame에 칼럼별로 lambda 함수를 적용해서 행 단위 함수를 적용하여 for loop 문을 사용하지 않고 새로운 칼럼 만들기' 태그의 글 목록

'pandas DataFrame에 칼럼별로 lambda 함수를 적용해서 행 단위 함수를 적용하여 for loop 문을 사용하지 않고 새로운 칼럼 만들기'에 해당되는 글 1건

2019.12.26 [Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 21

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 26. 19:56

이번 포스팅에서는

(1) 텍스트 파일을 열어 각 Line 별로 읽어 들인 후에 문자열 메소드를 이용해 파싱(Parsing)

--> pandas DataFrame으로 만들고,

(2) ID를 기준으로 그룹별로 값을 한칸식 내려서(Lag) 새로운 칼럼을 만들기

를 해보겠습니다.

아래와 같이 생긴 텍스트 파일이 있다고 하겠습니다.

'color_range.txt' 파일

color_range.txt

첫번째 행 AAA는 0에서 100까지는 a 영역, 100부터 200까지는 b 영역이라는 의미입니다. 여기서 a(빨간색), b(파란색)은 색상을 나타내며, AAA 는 0(포함)부터 100(미포함)까지는 빨간색, 100(포함)부터 200(미포함)까지는 파란색, 200(포함)부터 300(미포함)까지는 빨간색, ... 을 의미합니다.

이렇게 데이터가 행으로 옆으로 길게 늘여져서 쓰여진 파일을 'AAA', 'BBB' 의 ID별로 색깔(a: 빨간색, b: 파란색)별 시작 숫자와 끝 숫자를 알기 쉽게 각 칼럼으로 구분하여 pandas DataFrame으로 만들어보고자 합니다.

(1) 텍스트 파일을 열어 각 Line별로 읽어들인 후 문자열 메소드를 이용해 파싱(Parsing)

--> pandas DataFrame 만들기

import pandas as pd

import os

# set file path

cwd = os.getcwd()

file_path = os.path.join(cwd, 'color_range.txt')

# read 'color_range.txt' file and parsing it by id and value

df = pd.DataFrame() # blank DataFrame to store results

# open file

f = open(file_path)

# parsing text line by line using for loop statement

for line in f.readlines():

id_list = []

color_list = []

bin_list = []

# remove white space

line = line.strip()

# delete '"'

line = line.replace('"', '')

# get ID and VALUE from a line

id = line[:3]

val = line[4:]

# make a separator with comma(',')

val = val.replace(' a', ',a')

val = val.replace(' b', ',b')

# split a line using separator ','

val_split = val.split(sep=',')

# get a 'ID', 'COLOR', 'BIN_END' values and append it to list

for j in range(len(val_split)):

id_list.append(id)

color_list.append(val_split[j][:1])

bin_list.append(val_split[j][2:])

# make a temp DataFrame, having ID, COLOR, BIN_END values per each line

# note: if a line has only one value(ie. Scalar), then it will erase 'index error' :-(

df_tmp = pd.DataFrame({'id': id_list,

'color_cd': color_list,

'bin_end': bin_list}

)

# combine df and df_tmp one by one

df = pd.concat([df, df_tmp], axis=0, ignore_index=True)

# let's check df DataFrame

[Out]:

	id	color_cd	bin_end
0	AAA	a	100
1	AAA	b	200
2	AAA	a	300
3	AAA	b	400
4	BBB	a	250
5	BBB	b	350
6	BBB	a	450
7	BBB	b	550
8	BBB	a	650
9	BBB	b	750
10	BBB	a	800
11	BBB	b	910

(2) ID를 기준으로 그룹별로 값을 한칸식 내려서(Lag) 새로운 칼럼을 만들기

'ID'를 기준으로 'bin_end' 칼럼을 한칸씩 내리고 (shift(1)), 첫번째 행의 결측값은 '0'으로 채워(fillna(0))보겠습니다.

# lag 1 group by 'id' and fill missing value with '0'

df['bin_start'] = df.groupby('id')['bin_end'].shift(1).fillna(0)

[Out]:

	id	color_cd	bin_end	bin_start
0	AAA	a	100	0
1	AAA	b	200	100
2	AAA	a	300	200
3	AAA	b	400	300
4	BBB	a	250	0
5	BBB	b	350	250
6	BBB	a	450	350
7	BBB	b	550	450
8	BBB	a	650	550
9	BBB	b	750	650
10	BBB	a	800	750
11	BBB	b	910	800

color code ('color_cd')에서 'a' 는 빨간색(red), 'b'는 파란색(blue) 이라는 색깔 이름을 매핑해보겠습니다.

# mapping color using color_cd

color_map = {'a': 'red',

'b': 'blue'}

df['color'] = df['color_cd'].map(lambda x: color_map.get(x, x))

[Out]:

	id	color_cd	bin_end	bin_start	color
0	AAA	a	100	0	red
1	AAA	b	200	100	blue
2	AAA	a	300	200	red
3	AAA	b	400	300	blue
4	BBB	a	250	0	red
5	BBB	b	350	250	blue
6	BBB	a	450	350	red
7	BBB	b	550	450	blue
8	BBB	a	650	550	red
9	BBB	b	750	650	blue
10	BBB	a	800	750	red
11	BBB	b	910	800	blue

보기에 편리하도록 칼럼 순서를 'id', 'color_cd', 'color', 'bin_start', 'bin_end' 의 순서대로 재배열 해보겠습니다.

# change the sequence of columns

df = df[['id', 'color_cd', 'color', 'bin_start', 'bin_end']]

[Out]:

	id	color_cd	color	bin_start	bin_end
0	AAA	a	red	0	100
1	AAA	b	blue	100	200
2	AAA	a	red	200	300
3	AAA	b	blue	300	400
4	BBB	a	red	0	250
5	BBB	b	blue	250	350
6	BBB	a	red	350	450
7	BBB	b	blue	450	550
8	BBB	a	red	550	650
9	BBB	b	blue	650	750
10	BBB	a	red	750	800
11	BBB	b	blue	800	910

bin_start 는 포함하고 (include), bin_end 는 포함하지 않는(not include) 것을 알기 쉽도록

==> 포함('[') 기호 + 'bin_start', 'bin_end' + 미포함(')') 기호를 덧붙여서

'bin_range'라는 새로운 칼럼을 만들어보겠습니다.

# make a 'Bin Range' column with include '[' and exclude ')' sign

df['bin_range'] = df['bin_start'].apply(lambda x: '[' + str(x) + ',') + \

df['bin_end'].apply(lambda x: str(x + ')'))

[Out]:

	id	color_cd	color	bin_start	bin_end	bin_range
0	AAA	a	red	0	100	[0,100)
1	AAA	b	blue	100	200	[100,200)
2	AAA	a	red	200	300	[200,300)
3	AAA	b	blue	300	400	[300,400)
4	BBB	a	red	0	250	[0,250)
5	BBB	b	blue	250	350	[250,350)
6	BBB	a	red	350	450	[350,450)
7	BBB	b	blue	450	550	[450,550)
8	BBB	a	red	550	650	[550,650)
9	BBB	b	blue	650	750	[650,750)
10	BBB	a	red	750	800	[750,800)
11	BBB	b	blue	800	910	[800,910)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'pandas DataFrame에 칼럼별로 lambda 함수를 적용해서 행 단위 함수를 적용하여 for loop 문을 사용하지 않고 새로운 칼럼 만들기'에 해당되는 글 1건

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바