18、 Python快速開發分佈式搜索引擎Scrapy精講—Scrapy啓動文件的配置—xpath表達式

原創

天降攻城獅

2019-07-05 09:25

【百度雲搜索，搜各種資料:http://www.bdyss.cn】

【搜網盤，搜各種資料:http://www.swpan.cn】

我們自定義一個main.py來作爲啓動文件

main.py

#!/usr/bin/env python
# -*- coding:utf8 -*-

from scrapy.cmdline import execute  #導入執行scrapy命令方法
import sys
import os

sys.path.append(os.path.join(os.getcwd())) #給Python解釋器，添加模塊新路徑 ,將main.py文件所在目錄添加到Python解釋器

execute(['scrapy', 'crawl', 'pach', '--nolog'])  #執行scrapy命令

爬蟲文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import urllib.response
from lxml import etree
import re

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        pass

xpath表達式

1、

2、

3、

基本使用

allowed_domains設置爬蟲起始域名
start_urls設置爬蟲起始url地址
parse(response)默認爬蟲回調函數，response返回的是爬蟲獲取到的html信息對象，裏面封裝了一些關於htnl信息的方法和屬性

responsehtml信息對象下的方法和屬性
response.url獲取抓取的rul
response.body獲取網頁內容
response.body_as_unicode()獲取網站內容unicode編碼
xpath()方法，用xpath表達式過濾節點
extract()方法，獲取過濾後的數據，返回列表

# -*- coding: utf-8 -*-
import scrapy

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        leir = response.xpath('//a[@class="archive-title"]/text()').extract()  #獲取指定標題
        leir2 = response.xpath('//a[@class="archive-title"]/@href ').extract() #獲取指定url

        print(response.url)    #獲取抓取的rul
        print(response.body)   #獲取網頁內容
        print(response.body_as_unicode())  #獲取網站內容unicode編碼

        for i in leir:
            print(i)
        for i in leir2:
            print(i)

【轉載自：http://www.lqkweb.com】

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

18、 Python快速開發分佈式搜索引擎Scrapy精講—Scrapy啓動文件的配置—xpath表達式

【百度雲搜索，搜各種資料:http://www.bdyss.cn】

【搜網盤，搜各種資料:http://www.swpan.cn】

25、Python快速開發分佈式搜索引擎Scrapy精講—Requests請求和Response響應介紹

24、Python快速開發分佈式搜索引擎Scrapy精講—爬蟲和反爬的對抗過程以及策略—scrapy架構源碼分析圖

23、 Python快速開發分佈式搜索引擎Scrapy精講—craw scrapy item loader機制

22、Python快速開發分佈式搜索引擎Scrapy精講—scrapy模擬登陸和知乎倒立文字驗證碼識別

20、 Python快速開發分佈式搜索引擎Scrapy精講—編寫spiders爬蟲文件循環抓取內容

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結