[Python] Web Scraping! - 웹 페이지 내의 텍스트와 이미지

드디어 궁금했던 Web Scraping section에 도달했다.

몇 가지 스텝으로 아주 간단한 목차 제목과 이미지를 Scrape하는 것의 기록을 남겨보고자 한다.

내용은 Udemy의 Python Bootcamp를 기반으로 한다.

* Web Scraping: 웹에서 필요한 정보를 추출하는 방법

* toscrape.com : scraping을 위한 테스트 웹페이지

* 필요한 Modules:

1. requests : HTTP 요청을 보내는 모듈

2. bs4: Beautiful Soup v4. HTML 과 XML 파일을 쓰기 편하게 만들어주는 모듈.

cmd에서 pip install module_name 으로 설치하자.

* 기본 웹페이지 구성

1. HTML: 웹페이지의 기본 구조와 내용을 구성

2. CSS: 웹페이지 스타일링

3. JavaScript: 웹페이지에서 상호적인 (interactive) 요소를 구성

* Basic HTML Structure

<!DOCTYPE html>

<html>

<head>

<title> Page Title </title>

</head>

<body>

<h1> Heading 1 </h1>

<p> Paragraph </p>

</body>

</html>

이렇게 open / closing tags로 구성되어 있다.

여기에 class (.class_name), id (#id) 등을 부여해주기도 한다.

더 자세한 설명은 여기 - www.w3schools.com/html/default.asp

HTML Tutorial

HTML Tutorial HTML is the standard markup language for Web pages. With HTML you can create your own Website. HTML is easy to learn - You will enjoy it! Start learning HTML now » Easy Learning with HTML "Try it Yourself" With our "Try it Yourself" editor,

www.w3schools.com

* Chrome에서 웹페이지 HTML & CSS 확인하기

- 웹페이지에서 오른쪽 클릭 - 페이지 소스 보기 (Ctrl + U)

- 혹은 더 알아보기 쉽게 소스 코드의 범위를 줄여서 보려면,

검사를 누르면 용이하다.

여기서 목차의 타이틀들은 "toctext" class 안에 있음을 볼 수 있다

Web Scraping 예제

1. Wikipedia 목차 타이틀들 가져오기

reference - en.wikipedia.org/wiki/Grace_Hopper

Grace Hopper - Wikipedia

American computer scientist and United States Navy admiral Grace Brewster Murray Hopper (née Murray December 9, 1906 – January 1, 1992) was an American computer scientist and United States Navy rear admiral.[1] One of the first programmers of the Harva

en.wikipedia.org

# Importing Modules
import requests
import bs4
page_source = requests.get("https://en.wikipedia.org/wiki/Grace_Hopper")
soup = bs4.BeautifulSoup(page_source.text, "lxml")

# as seen from Inspection, TOC titles are in the class called toctext
# Loop through to extract only text

for item in soup.select('.toctext'):
print(item.text)

2. Image 가져오기

* Note. Copyright 항상 확인하기

이미지는 HTML에서 <img> tag에 있는 것을 기억하자.

그리고 많은 이미지들 중에서 본문에 있는 이미지들은

thumbinner 클래스 안에 있는 걸 검사 (Inspection)를 통해 볼 수 있었다.

import requests
import bs4

source = requests.get("https://en.wikipedia.org/wiki/Grace_Hopper")
soup = bs4.BeautifulSoup(source.text, "lxml")

first_image = soup.select('.thumbinner img')[0]

link_image = requests.get('https:' + first_image['src'])

f = open('C://Users/Samsung/Desktop/Wiki_image.jpg','wb')
f.write(link_image.content)
f.close()

짜잔! 바탕화면에 이미지가 저장된 걸 볼 수 있다.

Connecting the dots