實踐目的:
獲取豆瓣電影TOP250的所有電影的名稱,網頁地址爲:
https://movie.douban.com/top250
第一步:
得到網頁的HTML代碼:
import requests
def get_movies():
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/70.0.3538.25 \
Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400',
'Host': 'movie.douban.com'
}
for i in range(0, 10):
link = 'https://movie.douban.com/top250?start=' + str(i * 25)
r = requests.get(link, headers=headers, timeout=20)
print(str(i+1),"頁響應狀態碼:", r.status_code)
print(r.text)
get_movies()
第二步:
解析網頁:
import requests
from bs4 import BeautifulSoup
def get_movies():
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/70.0.3538.25 \
Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400',
'Host': 'movie.douban.com'
}
movie_list = []
for i in range(0, 10):
link = 'https://movie.douban.com/top250?start=' + str(i * 25)
r = requests.get(link, headers=headers, timeout=20)
print(str(i+1),"頁響應狀態碼:", r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
div_list = soup.find_all('div', class_='hd')
for each in div_list:
movie = each.a.span.text.strip()
movie_list.append(movie)
return movie_list
movies = get_movies()
print(movies)