關於scrapy框架使用的筆記

1.parse.urljoin(base,url)的使用

from urllib import parse
Request(url=parse.urljoin(response.url, url), callback=self.parse_detail)

提取出response.url的主域名與url(/111954/)做url的拼接。若url中有域名,拼接時不會用response.url提取出的主域名。

2.在Request上綁定數據,利用meta

from scrapy.http import Request
yield Request(url=parse.urljoin(response.url, article_url), 
              meta={"image_url": image_url},
              callback=self.parse_detail)

數據提取:

front_image_url = response.meta.get("image_url","")

3.設置圖片自動下載

settings.py中的基本配置:

ITEM_PIPELINES = {
    'ArtSpider.pipelines.ArtspiderPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
# 爲scrapy配置要從item中取出的參數
IMAGES_URLS_FIELD = "front_image_url"

# 設置圖片的下載路徑
project_dir = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')

特別注意:在爲item中的圖片路徑(front_image_url)賦值時,值必須是一個list。

article_item['front_image_url'] = [front_image_url]

因爲在scrapy.pipelines.images.ImagePipeline中,會對這個值進行遍歷。

    def get_media_requests(self, item, info):
        return [Request(x) for x in item.get(self.images_urls_field, [])]

自定義pipeline,獲取圖片存放的地址與名稱,以方便與對應的item聯繫起來:

class ArticleImagePipeline(ImagesPipeline):
    """
    自定義pipeline方法,繼承scrapy.pipelines.images.ImagesPipeline,並重載item_completed
    """

    def item_completed(self, results, item, info):
        for ok, value in results:
            front_image_path = value['path']
        item['front_image_path'] = front_image_path
        # 返回item以方便後面的pipeline對item進行處理
        return item

settings.py設置:

ITEM_PIPELINES = {
    'ArtSpider.pipelines.ArtspiderPipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'ArtSpider.pipelines.ArticleImagePipeline': 1,
}

4.數據更新(mysql方式,其它數據庫不適用)

insert_sql = """
                        insert into jobbole_article(title,url,create_date,fav_nums,url_object_id,front_image_url,
                                front_image_path,praise_nums,comment_nums,tags,content) 
                        values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                        on duplicate key update fav_nums=values(fav_nums),comment_nums=values(comment_nums);
                        """

在插入數據時,若主鍵已存在,表明數據已存在,更新數據;若主鍵不存在,則插入數據。

5.python爬蟲小工具:(模塊:copyhreaders,直接複製頭文件,無須挨個添加雙引號)

pip install copyhreaders

from copyheaders import headers_raw_to_dict
post_headers_raw = b"""
    Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
    Accept-Encoding:gzip, deflate, sdch
    Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2
    Connection:keep-alive
    Host:www.zhihu.com
    User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
    Referer:http://www.zhihu.com/
"""
headers = headers_raw_to_dict(post_headers_raw)

項目地址:https://github.com/jin10086/copyheaders

6.用於在刪除提取內容中的html tag

# 用於刪除提取的html中的tag
from w3lib.html import remove_tags

job_desc = scrapy.Field(
        input_processor=MapCompose(remove_tags)
    )

7.自定義setting

在spider類中:

custom_settings = {

        "COOKIES_ENABLED":True

}

@classmethod
def update_settings(cls, settings):
    settings.setdict(cls.custom_settings or {}, priority='spider')

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章