網頁解析上比較直觀還是用xpath解析。這種解析方法充分運用html標籤結構,通過樹狀結構,一層層定位到標籤。
- 用tree=etree.HTML(pagetext)語句生成etree對象,
- 需要注意xpath('表達式')返回的是列表,不要當成string,所以利用索引取得相應的字符串。如: image_src=image.xpath('./@src')[0] , name=image.xpath('./@alt')[0]
- 需要注意xpath返回列表從1開始計數,不是從0開始。mglist=tree.xpath('//div[@class="article font16"]/p[3]/img')
from lxml import etree
import os
import requests
if __name__=='__main__':
if not os.path.exists("./images1"):
os.mkdir("./images1")
url="http://www.heiguang.com/photography/pandp/20160105/63228.html"
headers={ "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
pagetext=requests.get(url=url,headers=headers).text#取得整個頁面文本
tree=etree.HTML(pagetext)
imglist=tree.xpath('//div[@class="article font16"]/p[3]/img')
for image in imglist:
image_src=image.xpath('./@src')[0]
image_content=requests.get(url=image_src,headers=headers).content
print(image_src)
name=image.xpath('./@alt')[0]+image_src.split('/')[-1]
name=name.encode('iso-8859-1').decode('utf-8')
image_path="./images1/" + name
print(image_path)
with open(image_path,"wb") as fp:
fp.write(image_content)
print("end")