1.parse.urljoin(base,url)的使用
from urllib import parse
Request(url=parse.urljoin(response.url, url), callback=self.parse_detail)
提取出response.url的主域名與url(/111954/)做url的拼接。若url中有域名,拼接時不會用response.url提取出的主域名。
2.在Request上綁定數據,利用meta
from scrapy.http import Request
yield Request(url=parse.urljoin(response.url, article_url),
meta={"image_url": image_url},
callback=self.parse_detail)
數據提取:
front_image_url = response.meta.get("image_url","")
3.設置圖片自動下載
settings.py中的基本配置:
ITEM_PIPELINES = {
'ArtSpider.pipelines.ArtspiderPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1,
}
# 爲scrapy配置要從item中取出的參數
IMAGES_URLS_FIELD = "front_image_url"
# 設置圖片的下載路徑
project_dir = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')
特別注意:在爲item中的圖片路徑(front_image_url)賦值時,值必須是一個list。
article_item['front_image_url'] = [front_image_url]
因爲在scrapy.pipelines.images.ImagePipeline中,會對這個值進行遍歷。
def get_media_requests(self, item, info):
return [Request(x) for x in item.get(self.images_urls_field, [])]
自定義pipeline,獲取圖片存放的地址與名稱,以方便與對應的item聯繫起來:
class ArticleImagePipeline(ImagesPipeline):
"""
自定義pipeline方法,繼承scrapy.pipelines.images.ImagesPipeline,並重載item_completed
"""
def item_completed(self, results, item, info):
for ok, value in results:
front_image_path = value['path']
item['front_image_path'] = front_image_path
# 返回item以方便後面的pipeline對item進行處理
return item
settings.py設置:
ITEM_PIPELINES = {
'ArtSpider.pipelines.ArtspiderPipeline': 300,
# 'scrapy.pipelines.images.ImagesPipeline': 1,
'ArtSpider.pipelines.ArticleImagePipeline': 1,
}
4.數據更新(mysql方式,其它數據庫不適用)
insert_sql = """ insert into jobbole_article(title,url,create_date,fav_nums,url_object_id,front_image_url, front_image_path,praise_nums,comment_nums,tags,content) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) on duplicate key update fav_nums=values(fav_nums),comment_nums=values(comment_nums); """
在插入數據時,若主鍵已存在,表明數據已存在,更新數據;若主鍵不存在,則插入數據。
5.python爬蟲小工具:(模塊:copyhreaders,直接複製頭文件,無須挨個添加雙引號)
pip install copyhreaders
from copyheaders import headers_raw_to_dict
post_headers_raw = b"""
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2
Connection:keep-alive
Host:www.zhihu.com
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
Referer:http://www.zhihu.com/
"""
headers = headers_raw_to_dict(post_headers_raw)
項目地址:https://github.com/jin10086/copyheaders
6.用於在刪除提取內容中的html tag
# 用於刪除提取的html中的tag
from w3lib.html import remove_tags
job_desc = scrapy.Field(
input_processor=MapCompose(remove_tags)
)
7.自定義setting
在spider類中:
custom_settings = {
"COOKIES_ENABLED":True
}
@classmethod
def update_settings(cls, settings):
settings.setdict(cls.custom_settings or {}, priority='spider')