[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기

지난번 포스팅에서는 웹에서 JSON 포맷 파일을 읽어와서 pandas DataFrame으로 변환하는 방법에 대해서 소개하였습니다.

이번 포스팅에서는 JSON과 함께 웹 애플리케이션에서 많이 사용하는 데이터 포맷인 XML (Extensible Markup Language) 을 Python을 사용하여 웹으로 부터 읽어와서 파싱(parsing XML), pandas DataFrame으로 변환하여 분석과 시각화하는 방법을 소개하겠습니다.

XML (Extensible Markup Language) 인간과 기계가 모두 읽을 수 있는 형태로 문서를 인코딩하는 규칙의 집합을 정의하는 마크업 언어(Markup Language) 입니다. XML의 설계 목적은 단순성, 범용성, 인터넷에서의 활용성을 강조점을 둡니다. XML은 다양한 인간 언어들을 유니코드를 통해 강력하게 지원하는 텍스트 데이터 포맷입니다. 비록 XML의 설계가 문서에 중점을 두고는 있지만, XML은 임의의 데이터 구조를 띠는 웹 서비스와 같은 용도의 재표현을 위한 용도로 광범위하게 사용되고 있습니다.
- from wikipedia (https://en.wikipedia.org/wiki/XML) -

XML 은 아래와 같이 생겼는데요, HTML, JSON과 왠지 비슷하게 생겼지요?

[ XML format data 예시 ]

<CATALOG>

<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>

<CD>

<TITLE>Unchain my heart</TITLE>
<ARTIST>Joe Cocker</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>EMI</COMPANY>
<PRICE>8.20</PRICE>
<YEAR>1987</YEAR>
</CD>

</CATALOG>

- source: https://www.w3schools.com/xml/cd_catalog.xml -

[ Python으로 웹에서 XML 데이터를 읽어와서 pandas DataFrame으로 만들기 코드 예제 ]

(1) Import Libraries

먼저 XML 을 파싱하는데 필요한 xml.etree.ElementTree 모듈과 웹 사이트에 접속해서 XML 파일을 읽을 수 있도록 요청하는 urllib 모듈을 불러오겠습니다. XML 데이터를 나무(Tree)에 비유해서 뿌리(root)부터 시작하여 줄기, 가지, 잎파리까지 단계적으로 파싱한다는 의미에서 모듈 이름이 xml.etree.ElementTree 라고 생각하면 됩니다.

import pandas as pd

import xml.etree.ElementTree as ET

import sys

if sys.version_info[0] == 3:

from urllib.request import urlopen

else:

from urllib import urlopen

Python 3.x 버전에서는 'from urllib.request import urlopen'으로 urllib 모듈의 request 메소드를 import해야 하며, 만약 Python 3.x 버전에서 아래처럼 'from urllib import urlopen' 을 사용하면 'ImportError: cannot import name 'urlopen'' 이라는 ImportError가 발생합니다.

# If you are using Python 3.x version, then ImportError will be raised as below

from urllib import urlopen

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-dbf1dbb53f94> in <module>()
----> 1 from urllib import urlopen

ImportError: cannot import name 'urlopen'

(2) Open URL and Read XML data from Website URL

이제 "https://www.w3schools.com/xml/cd_catalog.xml" 사이트에서 XML 포맷의 CD catalog 정보를 문자열(string)로 읽어와보겠습니다.

url = "https://www.w3schools.com/xml/cd_catalog.xml"

response = urlopen(url).read()

xtree = ET.fromstring(response)

xtree

<Element 'CATALOG' at 0x00000219E77DCCC8>

(3) Parsing XML data into text by iterating through each node of the tree

다음으로 for loop을 돌면서 나무의 노드들(nodes of tree)에서 필요한 정보를 찾아 파싱(find and parse XML data)하여 텍스트 데이터(text)로 변환하여 사전형(Dictionary)의 키, 값의 쌍으로 차곡차곡 저장(append)을 해보겠습니다.

rows = []

# iterate through each node of the tree

for node in xtree:

n_title = node.find("TITLE").text

n_artist = node.find("ARTIST").text

n_country = node.find("COUNTRY").text

n_company = node.find("COMPANY").text

n_price = node.find("PRICE").text

n_year = node.find("YEAR").text

rows.append({"title": n_title,

"artist": n_artist,

"country": n_country,

"company": n_company,

"price": n_price,

"year": n_year})

(4) Convert XML text data into pandas DataFrame

# convert XML data to pandas DataFrame

columns = ["title", "artist", "country", "company", "price", "year"]

catalog_cd_df = pd.DataFrame(rows, columns = columns)

catalog_cd_df.head(10)

	title	artist	country	company	price	year
0	Empire Burlesque	Bob Dylan	USA	Columbia	10.90	1985
1	Hide your heart	Bonnie Tyler	UK	CBS Records	9.90	1988
2	Greatest Hits	Dolly Parton	USA	RCA	9.90	1982
3	Still got the blues	Gary Moore	UK	Virgin records	10.20	1990
4	Eros	Eros Ramazzotti	EU	BMG	9.90	1997
5	One night only	Bee Gees	UK	Polydor	10.90	1998
6	Sylvias Mother	Dr.Hook	UK	CBS	8.10	1973
7	Maggie May	Rod Stewart	UK	Pickwick	8.50	1990
8	Romanza	Andrea Bocelli	EU	Polydor	10.80	1996
9	When a man loves a woman	Percy Sledge	USA	Atlantic	8.70	1987

(5) Change data type from string object to float64, int32 for numeric data

아래에 df.dtypes 로 각 칼럼의 데이터 형태를 확인해보니 전부 문자열 객체(string object)입니다. astype()을 이용하여 칼럼 중에서 price는 float64, year는 int32로 변환을 해보겠습니다.

catalog_cd_df.dtypes

title      object
artist     object
country    object
company    object
price      object
year       object
dtype: object

import numpy as np

catalog_cd_df = catalog_cd_df.astype({'price': np.float,

'year': int})

catalog_cd_df.dtypes

title       object
artist      object
country     object
company     object
price      float64
year         int32
dtype: object

(6) Calculate mean value of price by Country and plot bar plot it

country_mean = catalog_cd_df.groupby('country').price.mean()

country_mean

country
EU        9.320000
Norway    7.900000
UK        8.984615
USA       9.385714
Name: price, dtype: float64

country_mean_df = pd.DataFrame(country_mean).reset_index()

import seaborn as sns

sns.barplot(x='country', y='price', data=country_mean_df)

plt.show()

이상으로 웹에서 XML 데이터를 Python으로 읽어와서 파싱 후 pandas DataFrame으로 변환하는 방법에 대한 소개를 마치겠습니다.

Python으로 JSON 파일 읽기, 쓰기는 https://rfriend.tistory.com/474 를 참고하세요.

Python으로 YAML 파일 읽기, 쓰기는 https://rfriend.tistory.com/540 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법 (0)	2019.09.15
[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31
[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31
[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value) (0)	2019.08.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역