前面爬取過毒液影評,這段時間很多人找我要源碼,我之前的代碼已經遺失,所以重新做了下,分享給大家,希望幫到大家😀
爬取豆瓣電影惡人轉:https://movie.douban.com/subject/30211551/comments?start=0&limit=20&sort=new_score&status=P
1.分析網頁
可以看出,strat=0&limit=20,說明第一頁,一頁20條
第二頁,以此類推,所以我們可以得到此信息頁。
2.保存所有網頁
直接給出保存代碼,同學們自己查看:
import requests
from bs4 import BeautifulSoup
def gethtml(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
xx= requests.get(url, headers=headers, timeout=3)
xx.encoding='utf-8'
html=xx.text
return html
#soup=BeautifulSoup(html, 'html.parser')
#for i in soup.find_all('div', class_="comment-item"): # 名字
# data.append(i)
#alldata.append(data)
def savehtml(file_name,html):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
# 寫文件用bytes而不是str,所以要轉碼
f.write(html.encode('utf-8'))
x=0
while x <200:
y=str(x)
url='https://movie.douban.com/subject/30211551/comments?start='+y+'&limit=20&sort=new_score&status=P'
html=gethtml(url)
xxx=savehtml('data'+y,html)
x=x+20
顯示結果如上,我們防止其被封殺,所以保存網頁下來
3.數據處理
from bs4 import BeautifulSoup
num=0
allname = []
alltime = []
alldatapeople = []
alltalk = []
while num<200:
y=str(num)
with open('data'+y+'.html',"r",encoding='utf-8') as f:
soup=BeautifulSoup(f.read(), 'html.parser')
#print(soup)
data=[]
alldata=[]
for i in soup.find_all('div', class_="comment-item"):
data.append(i.text.replace("\n",''))
alldata.append(i)
#print(data)
name=[]
time=[]
datapeople=[]
talk=[]
for i in data:
alldatapeople.append(i.split(" ")[0].split('有用')[0])
allname.append(i.split(" ")[0].split('有用')[1].split('看過')[0])
alltime.append(i.split(" ")[1].replace(' ',''))
alltalk.append(i.split(" ")[2])
num = num + 20
#print(alldata)
print(alldatapeople)
print(allname)
print(alltime)
print(alltalk)
上述獲取了200條評論,以及時間,評論者名字,顯示如下:
數據可視化,以及數據挖掘,從上述數據中自行處理,此處不多講解
4.獲取更多數據
我們已經獲取數據了,以及通過數據可視化,和數據挖掘,那麼我們依舊可以獲取更多數據,例如,評論者得地區:代碼如下:
import requests,csv
from bs4 import BeautifulSoup
def gethtml():
num = 0
alldata=[]
while num<200:
y=str(num)
with open('data'+y+'.html',"r",encoding='utf-8') as f:
soup=BeautifulSoup(f.read(), 'html.parser')
data=[]
newdata=[]
for i in soup.find_all('div', class_="comment-item"):
data.append(i)
for i in data:
x=str(i).replace("\n","")
newdata.append(x)
#print(newdata)
for i in newdata:
# print(i.split('href="')[1].split('"')[0])
alldata.append(i.split('href="')[1].split('"')[0])
num = num + 20
return alldata
#https://www.douban.com/people/128265644/
def getdata(url_list):
alldata=[]
for i in url_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
xx = requests.get(i, headers=headers, timeout=30)
xx.encoding = 'utf-8'
html = xx.text
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('div', class_="user-info"): # 名字
try:
print(i.text.split("常居: ")[1].split("\n")[0])
alldata.append(i.text.split("常居: ")[1].split("\n")[0])
except:
alldata.append("暫無")
with open("居住地.csv", 'w+', newline="",encoding='utf-8') as f:
writer = csv.writer(f)
for row in alldata:
writer.writerow(row)
y=gethtml()
x=getdata(y)
結果如下:
我將其保存爲csv格式,應爲涉及到得網頁較多,所以容易被封,建議大家保存數據在自己電腦上,或者代理ip,等等,基本上所有代碼都在上面了,關於數據分析,大家自行百度,如何畫圖,之類的既然有數據,那應該很簡單了,大家依舊可以聯繫我,幫大家解決問題:qq:626529441.