應朋友之約,幫他做個爬蟲,並且每個網頁的數據都分別導入到excel中。
目標網站:http://www.hs-bianma.com/hs_chapter_01.htm
根據我的觀察,網頁採取的是<td><th>製成表格來存放數據,屬於非常簡單的類型。因爲Python自帶有非常好的網頁處理模塊,因此前後代碼花費時間在30分鐘。
網站:
網頁源代碼:
需要模塊:BeautifulSoup、Request、xlwt
廢話不多說,直接上代碼:
from bs4 import BeautifulSoup
from urllib import request
import xlwt
#獲取數據
value=1
while value<=98:
value0=str(value)
url = "http://www.hs-bianma.com/hs_chapter_"+value0+".htm"
#url="http://www.hs-bianma.com/hs_chapter_01.htm"
'''此行可以自行更換代碼用來彙集數據'''
response = request.urlopen(url)
html = response.read()
html = html.decode("utf-8")
bs = BeautifulSoup(html,'lxml')
#標題處理
title = bs.find_all('th')
data_list_title=[]
for data in title:
data_list_title.append(data.text.strip())
#內容處理
content = bs.find_all('td')
data_list_content=[]
for data in content:
data_list_content.append(data.text.strip())
new_list=[data_list_content[i:i+16] for i in range(0,len(data_list_content),16)]
#存入excel表格
book=xlwt.Workbook()
sheet1=book.add_sheet('sheet1',cell_overwrite_ok=True)
#標題存入
heads=data_list_title[:]
ii=0
for head in heads:
sheet1.write(0,ii,head)
ii+=1
#print(head)
#內容錄入
i=1
for list in new_list:
j=0
for data in list:
sheet1.write(i,j,data)
j+=1
i+=1
#文件保存
book.save('sum'+value0+'.xls')
value += 1
print(value0+"寫入完成!")
print("全部完成")