๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿค– AI & DATA/DA

[web crawling] python requests & Beautiful Soup๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์›ํ•˜๋Š” ์ •๋ณด ์ถ”์ถœํ•˜๊ธฐ

1.  ์›น ํฌ๋กค๋ง(web crawling)์„ ๋ฐฐ์šฐ๋Š” ์ด์œ 

    • ์›น ํฌ๋กค๋ง์ด๋ž€ ์›นํŽ˜์ด์ง€(๋˜๋Š” ์›น ์‚ฌ์ดํŠธ, static document) ๋‚ด์— ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ–‰์œ„, ์ฆ‰ ์ธํ„ฐ๋„ท ์ฝ˜ํ…์ธ ๋ฅผ ์ƒ‰์ธํ™”ํ•˜๋Š” ๊ณผ์ •์„ ์˜๋ฏธํ•จ
    • ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ™œ์šฉํ•˜๊ณ ์‹ถ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์›น ํŽ˜์ด์ง€์—์„œ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ธฐ๋•Œ๋ฌธ์— ์ค‘์š”ํ•จ
    • Beautiful Soup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” html๊ณผ xml ๋ฌธ์„œ๋ฅผ parsing[๊ฐ์ฃผ:1] ํ•  ์ˆ˜ ์žˆ๊ณ , Selenium์€ ๋™์  ํฌ๋กค๋ง์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค. ๋‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ฐจ์ด๊ฐ€ ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ๋‹ค์Œ ๋งํฌ๋ฅผ ํด๋ฆญํ•˜์—ฌ ์ฐธ๊ณ 
    •  HTML์˜ ๊ธฐ๋ณธ์ ์ธ ์ดํ•ด๊ฐ€ ์žˆ์–ด์•ผํ•จ

2.  requests & Beautiful Soup ํ™œ์šฉ

  1. requests : ์›ํ•˜๋Š” ์›น ํŽ˜์ด์ง€์˜ html ๋ฌธ์„œ๋ฅผ ์‹น ๊ธ์–ด์˜จ๋‹ค. 
  2. Beautiful Soup : html ๋ฌธ์„œ์—์„œ ์›ํ•˜๋Š” ๊ฒƒ์„ ๊ณจ๋ผ์„œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

 

[ํ„ฐ๋ฏธ๋„์— ์ž…๋ ฅํ•˜์—ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ํ•˜๋Š” ์ฝ”๋“œ]

pip install requests

 

[ํŒŒ์ด์ฌ ์ฝ”๋“œ]

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import requests
from bs4 import BeautifulSoup

# ์›ํ•˜๋Š” ์‚ฌ์ดํŠธ์— ์ ‘์† : requests.get
url = "http://www.google.com"
response = requests.get(url)
print(response)  # <Response [200]>
print(response.text)  # HTML ๋ฌธ์„œ ์ถœ๋ ฅ


'''์ž˜๋ชป๋œ ์š”์ฒญ์— ๋Œ€ํ•œ ์˜ˆ์™ธ ์ฒ˜๋ฆฌ'''
# case 1
if response.status_code == 200:
    # ์‘๋‹ต๋ฐ›์€ text๋ฅผ HTML ํŒŒ์‹ฑ
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    # print(soup)
    print(soup.prettify())  # HTML ๋“ค์—ฌ์“ฐ๊ธฐํ•˜์—ฌ ๊น”๋”ํ•˜๊ฒŒ ๋ณด์ด๊ฒŒ ํ•ด์ฃผ๋Š” ํ•จ์ˆ˜
else:
    print('์ž˜๋ชป๋œ ์š”์ฒญ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ํ™•์ธ ๋ฐ”๋žŒ!')
    
    
# case 2
# ์œ„์˜ if ์ œ์–ด๋ฌธ๊ณผ ๊ฐ™์€ ํšจ๊ณผ
response.raise_for_status()  # 200๋ฒˆ๋Œ€ ์ฝ”๋“œ๊ฐ€ ์•„๋‹ˆ๋ฉด ์ฝ”๋“œ๊ฐ€ ๋ฉˆ์ถค
print("๋ฌธ์ œ๊ฐ€ ์—†์œผ๋ฉด ์ด ๋ถ€๋ถ„ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.", len(response.text))


'''ํŒŒ์ผ ์ƒ์„ฑ ๋ฐ ์ €์žฅํ•˜๊ธฐ'''
# with open('ํŒŒ์ผ๋ช…', '์˜ต์…˜', ์ธ์ฝ”๋”ฉ) as f
# ๊ฒฝ๋กœ๋ฅผ ์ง€์ •ํ•˜์ง€์•Š์œผ๋ฉด ํ•ด๋‹น ๋””๋ ‰ํ† ๋ฆฌ์— ์ƒ์„ฑ
with open('mypage.html', 'w', encoding ='utf-8') as f:
    f.write(response.text)

 

3. Beautiful Soup ํ•จ์ˆ˜ ์ •๋ฆฌ

  • find(name, attrs, recursive, string, **kwargs) : ํ•ด๋‹น ์กฐ๊ฑด์— ๋งž๋Š” ํ•˜๋‚˜์˜ ํƒœ๊ทธ ๊ฐ€์ ธ์˜ด(์ค‘๋ณต์ด๋ฉด ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ ํƒœ๊ทธ) โžก๏ธ ๋‹จ์ผ๋…ธ๋“œ ๋ฐ˜ํ™˜(1๊ฐœ)
  • find_all(name, attrs, recursive, string, limit, **kwargs) : ํ•ด๋‹น ์กฐ๊ฑด์— ๋งž๋Š” ๋ชจ๋“  ํƒœ๊ทธ ๊ฐ€์ ธ์˜ด โžก๏ธ list ๋ฐ˜ํ™˜(๋ณต์ˆ˜๊ฐœ)
# example html
html = """
<html>
    <body>
        <div id="wrap">
            <ul>
                <li><a href="https://www.naver.com/">NAVER</a></li>
                <li><a href="https://www.daum.net/">DAUM</a></li>
                <li><a href="https://www.google.com/">GOOGLE</a></li>
            </ul>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())  # ํ™•์ธ

# html ์ฝ”๋“œ ์•ˆ์—์„œ aํƒœ๊ทธ๋ฅผ ์ฐพ์•„์„œ ๋ฐ˜ํ™˜
print("find : ", soup.find('a'), type(soup.find('a')))  # 1๊ฐœ
print("find_all : ", soup.find_all('a'), type(soup.find_all('a')))  # 3๊ฐœ

'''
out:
find :  <a href="https://www.naver.com/">NAVER</a> <class 'bs4.element.Tag'>
find_all :  [<a href="https://www.naver.com/">NAVER</a>, <a href="https://www.daum.net/">DAUM</a>, <a href="https://www.google.com/">GOOGLE</a>] <class 'bs4.element.ResultSet'>
'''

 

  • get_text(), text : ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ
# with find
print(soup.find('a').get_text())
print(soup.find('a').text)

'''
out:
NAVER
NAVER
'''


# with find_all
# ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ฐ’์ด ์žˆ์„ ๋•Œ : for ๋ฐ˜๋ณต๋ฌธ๊ณผ ํ•จ๊ป˜ ํ™œ์šฉ
names = soup.find_all('a')
for name in names:
    print(name.text)
    print(name.get_text())
    
 
# ๋‹ค๋ฅธ ์˜ˆ์‹œ
# GOO๋กœ ์‹œ์ž‘ํ•˜๋Š” ์‚ฌ์ดํŠธ๋ฅผ ๋ฆฌํ„ด๋ฐ›์•„์„œ ์ถœ๋ ฅ
'''string.startswith() : ํŠน์ •๋ฌธ์ž์—ด๋กœ ์‹œ์ž‘ํ•˜๋Š”์ง€ ํ™•์ธ'''
names = soup.find_all('a')
for name in names:
    if name.text.startswith('GOO'):
        print(name.text)

 

  • find('ํƒœ๊ทธ', attrs = { id (class) :'์†์„ฑ'}) : ํƒœ๊ทธ์˜ ์†์„ฑ์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•
# ํƒœ๊ทธ์˜ ์†์„ฑ์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•
name = soup.find('a', attrs = {'id':'final'})
name

# out : <a href="https://www.google.com/" id="final">GOOGLE</a>
  • copy selector๋ฅผ ์‚ฌ์šฉํ•œ select_one, select : ํ•ด๋‹น url ์›น ์‚ฌ์ดํŠธ์—์„œ ๋งˆ์šฐ์Šค ์šฐํด๋ฆญ โžก๏ธ ๊ฒ€์‚ฌ โžก๏ธ ์›ํ•˜๋Š” ๋ถ€๋ถ„์— ๋Œ€ํ•œ html์—์„œ selector copy! โžก๏ธ select_one("๋ถ™์—ฌ๋„ฃ๊ธฐ")

 

[example 1]

์ˆœ์„œ์— ๋งž์ถ”์–ด ๋ˆŒ๋Ÿฌ์„œ copy!

## example
url = "https://kin.naver.com/search/list.naver?query=ํŒŒ์ด์ฌ"

response = requests.get(url)

if response.status_code == 200:
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    title = soup.select_one("#s_content > div.section > ul > li:nth-child(2) > dl > dt > a")
    print(title.find('b').text)

else:
    print(response.status_code)
    print("๋‹ค์‹œ ์‹œ๋„ํ•ด์ฃผ์„ธ์šฅ!!!!!!!!")
    
# out: ํŒŒ์ด์ฌ

# -------------------------------------------------------------
# ๊ฒ€์‚ฌ -> ์›ํ•˜๋Š” ํƒœ๊ทธ ์„ ํƒ -> ์šฐํด๋ฆญ -> copy -> copy selector
#s_content > div.section > ul > li:nth-child(2) > dl > dt > a
title = soup.select_one("#s_content > div.section > ul > li:nth-child(2) > dl > dt > a")
print(title.text)

# out : ํŒŒ์ด์ฌ ์•Œ๋ ค์ฃผ์‹ค ๋ถ„์žˆ๋‚˜์š”?


# -------------------------------------------------------------
## ์ œ๋ชฉ๋งŒ ์ถœ๋ ฅํ•˜๊ธฐ
# case 1
titles = soup.find_all('a', attrs = {'class':'_nclicks:kin.txt _searchListTitleAnchor'})

for title in titles:
    print(title.text)
    
    
# case 2
title = soup.find_all('dt')
result = []
for tl in title:
    result.append(tl.find('a').text)
print(*result, sep = "\n")


# case 3
titles = soup.select("li>dl>dt>a")
for title in titles:
    print(title.text)
   
   
'''
out:
ํŒŒ์ด์ฌ ์ €์žฅ
ํŒŒ์ด์ฌ ์•Œ๋ ค์ฃผ์‹ค ๋ถ„์žˆ๋‚˜์š”?
ํŒŒ์ด์ฌ ์งˆ๋ฌธ
ํŒŒ์ด์ฌ๋จธ์‹ ๋Ÿฌ๋‹ ๊ณต๋ถ€๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
ํŒŒ์ด์ฌ ์…€๋ ˆ๋‹ˆ์›€
ํŒŒ์ด์ฌ์ด๋‚˜ C์–ธ์–ด ๋ฐฐ์šฐ๋Š” ๋ฐฉ๋ฒ•
ํŒŒ์ด์ฌ์—์„œ๋Š” i = 3 ์ด๋Ÿฐ ๊ฒŒ ์—๋Ÿฌ๊ฐ€ ๋‚˜๋‚˜์š”?
ํŒŒ์ด์ฌ csvํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
ํŒŒ์ด์ฌ์€ ์ž๋ฃŒํ˜• ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•˜๋‚˜์š”?
ํŒŒ์ด์ฌ ์ค„๋ฐ”๊ฟˆ
'''

 

[example 2]

  • ์•„๋งˆ์กด ๊ณ„์—ด์‚ฌ์ธ ์•Œ๋ ‰์‚ฌ(Alexa Internet.Inc) : ์ „ ์„ธ๊ณ„์ ์œผ๋กœ ํ˜น์€ ๋‚˜๋ผ๋ณ„๋กœ ์›น ์‚ฌ์ดํŠธ์˜ ์ˆœ์œ„ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์ดํŠธ
  • ํ•ด๋‹น ์‚ฌ์ดํŠธ์—์„œ ์ˆœ์œ„ 5์œ„๊นŒ์ง€์˜ ๋žญํ‚น ์‚ฌ์ดํŠธ ์ด๋ฆ„ ์ถ”์ถœ

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์›น ์‚ฌ์ดํŠธ์—์„œ ์šฐํด๋ฆญ -> ๊ฒ€์‚ฌ

import requests
from bs4 import BeautifulSoup

url = "https://www.alexa.com/topsItes/countries/KR"

rank = requests.get(url).text
soup = BeautifulSoup(rank, 'html.parser')

# ํƒœ๊ทธ๋กœ ํ˜ธ์ถœํ•˜๊ธฐ
ranks = soup.select('p a')
ranks

# ์ˆœ์œ„ 5์œ„๊นŒ์ง€์˜ ๋žญํ‚น ์‚ฌ์ดํŠธ ์ด๋ฆ„๋งŒ
for i in range(1, 6):
    print(f"{i}. {ranks[i].text}")
    
'''
out:
1. Google.com
2. Naver.com
3. Youtube.com
4. Tistory.com
5. Daum.net
'''
    

## DataFrame์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
import pandas as pd

rank_dic = {'Website':ranks}
df = pd.DataFrame(rank_dic, columns = ["Website"])
df.drop(0, inplace = True)
df.head(5)

๐Ÿ“ ํ•จ์ˆ˜ ์ด ์ •๋ฆฌ

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ตฌ๋ถ„ ํ•จ์ˆ˜ ๊ธฐ๋Šฅ
requests response = requests.get({url}) ์›ํ•˜๋Š” ์›น ์‚ฌ์ดํŠธ(url)์— ์ ‘์†
(๋ณดํ†ต response ๋ณ€์ˆ˜์— ์ •์˜ํ•จ)
response.status_code ์‘๋‹ต ์ฝ”๋“œ(Response) ๋ฐ˜ํ™˜
response.text  HTML ๋ฌธ์„œ ๋ฆฌํ„ด
response.raise_for_status() 200๋ฒˆ๋Œ€ ์‘๋‹ต์ฝ”๋“œ๊ฐ€ ์•„๋‹ˆ๋ฉด ์˜ˆ์™ธ๋ฅผ ๋ฐœ์ƒ์‹œํ‚ด
(2XX : ์„ฑ๊ณต, 200 : ์„œ๋ฒ„๊ฐ€ ์š”์ฒญ์„ ์ œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ–ˆ๋‹ค๋Š” ๋œป)
Beautiful Soup
soup = BeautifulSoup(html, 'html.parser') ์‘๋‹ต๋ฐ›์€ HTML ํŒŒ์‹ฑ
(๋ณดํ†ต soup ๋ณ€์ˆ˜์— ์ •์˜ํ•จ)
soup.prettify()  
soup.find(name, attrs, recursive, string, **kwargs) ํ•ด๋‹น ์กฐ๊ฑด์— ๋งž๋Š” ํ•˜๋‚˜์˜ ํƒœ๊ทธ ๊ฐ€์ ธ์˜ด
(์ค‘๋ณต์ด๋ฉด ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ ํƒœ๊ทธ)
โžก๏ธ ๋‹จ์ผ๋…ธ๋“œ ๋ฐ˜ํ™˜(1๊ฐœ)
soup.find_all(name, attrs, recursive, string, limit, **kwargs) ํ•ด๋‹น ์กฐ๊ฑด์— ๋งž๋Š” ๋ชจ๋“  ํƒœ๊ทธ ๊ฐ€์ ธ์˜ด
โžก๏ธ list ๋ฐ˜ํ™˜(๋ณต์ˆ˜๊ฐœ)
soup.find('ํƒœ๊ทธ', attrs = { id (class) :'์†์„ฑ'}) ํƒœ๊ทธ์˜ ์†์„ฑ์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•
soup.get_text(), soup.text ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ
soup.select_one, soup.select copy selector์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฐพ๋Š” ํ•จ์ˆ˜

๐Ÿ“Ž Reference

  1. HTML ์ฝ”๋“œ ๊ตฌ๋ฌธ์„ ์š”์†Œ๋ณ„๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” ์ž‘์—… [๋ณธ๋ฌธ์œผ๋กœ]