[Python BeautifulSoup] 웹 페이지 크롤링, 스크랩핑 (How to crawl, scrape web page using BeautifulSoup)

이번 포스팅에서는 Python의 urllib 과 BeautifulSoup 모듈을 사용해서 웹 페이지의 내용을 파싱하여 필요한 데이터만 크롤링, 스크래핑하는 방법을 소개하겠습니다.

urllilb 모듈은 웹페이지 URL 을 다룰 때 사용하는 Python 라이브러리입니다. 가령, urllib.request 는 URL을 열고 읽을 때 사용하며, urllib.parse 는 URL을 파싱할 때 사용합니다.

BeautifulSoup 모듈은 HTML 과 XML 파일로부터 데이터를 가져올 때 사용하는 Python 라이브러리입니다. 이 모듈은 사용자가 선호하는 파서(parser)와 잘 작동하여, parse tree 를 조회하고 검색하고 수정하는 자연스러운 방법을 제공합니다.

python urllib, BeautifulSoup module for web scraping

이번 예제에서는

(1) urllib.request 의 urlopen 메소드로 https://oilprice.com/ 웹페이지에서 'lng' 라는 키워드로 검색했을 때 나오는 총 20개의 페이지를 열어서 읽은 후

(2) BeautifulSoup 모듈을 사용해 기사들의 각 페이지내에 있는 20개의 개별 기사들의 '제목(title)', '기사 게재일(timestamp)', '기사에 대한 설명 (description)' 의 데이터를 파싱하고 수집하고,

(3) 이들 데이터를 모아서 pandas DataFrame 으로 만들어보겠습니다. (총 20개 페이지 * 각 페이지별 20개 기사 = 총 400 개 기사 스크랩핑)

webpage crawling, scraping using python urllib, BeautifulSoup, pandas

아래의 예시 코드는 파송송님께서 짜신 것이구요, 각 검색 페이지에 20개씩의 기사가 있는데 제일 위에 1개만 크롤링이 되는 문제를 해결하는 방법을 문의해주셔서, 그 문제를 해결한 후의 코드입니다.

##-- How to Scrape Data on the Web with BeautifulSoup and urllib

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
from datetime import datetime

col_name = ['title', 'timestamp', 'descrip']
df_lng = pd.DataFrame(columns = col_name)

for j in range(20):
    ## open and read web page
    url = 'https://oilprice.com/search/tab/articles/lng/Page-' + str(j+1) + '.html'
    with urlopen(url) as response:
        soup = BeautifulSoup(response, 'html.parser')
        headlines = soup.find_all(
        	'div', 
        	{'id':'search-results-articles'}
        	)[0]
        
        ## getting all 20 titles, timestamps, descriptions on each page
        title = headlines.find_all('a')
        timestamp = headlines.find_all(
        	'div', 
        	{'class':'dateadded'}
            )
        descrip = headlines.find_all('p')
        
        
        ## getting data from each article in a page
        for i in range(len(title)):
            title_i = title[i].text
            timestamp_i = timestamp[i].text
            descrip_i = descrip[i].text

            # appending to DataFrame
            df_lng = df_lng.append({
            	'title': title_i, 
                'timestamp': timestamp_i, 
                'descrip': descrip_i}, 
                ignore_index=True)

        if j%10 == 0:
            print(str(datetime.now()) + " now processing : j = " + str(j))

# remove temp variables
del [col_name, url, response, title, title_i, timestamp, timestamp_i, descrip, descrip_i, i, j]

이번 포스팅이 많은 도움이 되었기를 바랍니다.

행복한 데이터 과학자 되세요! :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] x를 기준으로 정렬 후 그룹별로 y의 첫번째 값, 마지막 값을 DataFrame에 추가하기 (0)	2021.11.16
[Python pandas] TimeStamp 행별로 칼럼별 비율을 구하고 시도표 그리기 (0)	2021.11.14
[Python pandas] DataFrame의 칼럼 순서 바꾸기 (To change the order of DataFrame columns) (2)	2021.09.03
[Python pandas] MultiIndex Column 의 DataFrame 을 Column Level 기준으로 Stacking 해서 재구조화 하기 (0)	2021.08.30
[Python pandas] 결측값을 회귀모형 추정값으로 채우기 (fill missing values using prediction values of linear regression model) (2)	2021.05.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python BeautifulSoup] 웹 페이지 크롤링, 스크랩핑 (How to crawl, scrape web page using BeautifulSoup)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역