利用Scrapy爬取職友集中企業的信息數據

需求分析

要抓取的地址url: http://www.jobui.com/cmp
要抓取的信息,是對應的每個公司詳情頁中的數據
首先需要獲取所有公司的列表,程序自動翻頁,獲取下一頁的鏈接地址,獲取每個公司的詳情頁的url
獲取到詳情頁的url 發起請求,在詳情頁中獲取想要抓取的數據

代碼編寫

首先利用命令行創建爬蟲項目
```
scrapy startproject ZhiYouJi
```

編寫項目的items文件,定義要保存的字段,下圖找了一個信息比較全的公司的詳情頁,要保存的信息就是紅框圈出來的信息.那我們編輯items.py文件

import scrapy

class ZhiyoujiItem(scrapy.Item):
  # 公司名
  name = scrapy.Field()
  # 瀏覽量
  views = scrapy.Field()
  # 公司性質
  type = scrapy.Field()
  # 公司規模
  size = scrapy.Field()
  # 行業
  industry = scrapy.Field()
  # 公司簡稱
  abbreviation = scrapy.Field()
  # 公司信息
  info = scrapy.Field()
  # 好評度
  praise = scrapy.Field()
  # 薪資區間
  salary_range = scrapy.Field()
  # 公司產品
  products = scrapy.Field()
  # 融資情況
  financing_situation = scrapy.Field()
  # 公司排名
  rank = scrapy.Field()
  # 公司地址
  address = scrapy.Field()
  # 公司網站
  website = scrapy.Field()
  # 公司聯繫方式
  contact = scrapy.Field()
  # qq
  qq = scrapy.Field()

編寫好了items文件後,我們可以創建爬蟲文件了,這裏我使用的是CrawlSpider,使用命令行創建爬蟲文件之前,我們需要先cd到ZhiYouJi文件夾中,然後使用命令:
```
scrapy genspider -t crawl zhiyouji 'jobui.com'
```
創建好爬蟲文件後,我們使用pycharm打開項目
這樣我們前期的就準備好了,接下來就是怎麼去編寫爬蟲,怎麼去獲取數據了.
打開spider路徑下的zhiyouji.py文件,在這裏我們先分析一波.
首先我們要確定我們的起始的url也就是start_url,修改文件中start_url
```
start_urls = ['http://www.jobui.com/cmp']
```
首先,網站的數據是分頁的,我們要獲取到下一頁的url.

通過分析,我們發現下一頁的url地址的規律就是/cmp?n=頁數#listInter,那我們可以使用正則將下一頁的鏈接提取出來.
```
# 獲取下一頁的url

Rule(LinkExtractor(allow=r'/cmp\?n=\d+\#listInter'), follow=True),
```
獲取到了下一頁的url地址,我們接下來就是要獲取詳情頁的url地址了,通過查看發現,詳情頁的鏈接的規律是/company/數字/,那就可以使用正則來匹配出詳情頁的url
```
# 獲取詳情頁的url

Rule(LinkExtractor(allow=r'/company/\d+/$'), callback='parse_item', follow=False),
```

獲取詳情頁鏈接時,我們制定了回調函數,parse_item,那麼接下來就要在函數裏面提取我們想要得到的數據了.代碼其實很見到,就是利用了xpath來提取自己想要拿到的數據,xpath不熟悉的小夥伴可以去網上找一下教程學一下,下面放代碼:

  def parse_item(self, response):
      # 實例化item對象
      item = ZhiyoujiItem()

      # 使用xpath提取數據

      # 公司名稱
      item['name'] = response.xpath('//*[@id="companyH1"]/a/text()').extract_first()
      # 瀏覽量
      item['views'] = response.xpath('//div[@class="grade cfix sbox"]/div[1]/text()').extract_first().split(u'人')[0].strip()

      """
          有些公司的詳情頁面沒有圖片
          所以頁面的結構有些不同
      """
      # 公司性質
      try:
          item['type'] = response.xpath('//div[@class="cfix fs16"]/dl/dd[1]/text()').extract_first().split('/')[0]
      except:
          item['type'] = response.xpath('//*[@id="cmp-intro"]/div/div/dl/dd[1]/text()').extract_first().split('/')[0]

      # 公司規模
      try:
          item['size'] = response.xpath('//div[@class="cfix fs16"]/dl/dd[1]/text()').extract_first().split('/')[1]
      except:
          item['size'] = response.xpath('//*[@id="cmp-intro"]/div/div/dl/dd[1]/text()').extract_first().split('/')[1]

      # 行業
      item['industry'] = response.xpath('//dd[@class="comInd"]/a[1]/text()').extract_first()
      # 公司簡稱
      item['abbreviation'] = response.xpath('//dl[@class="j-edit hasVist dlli mb10"]/dd[3]/text()').extract_first()
      # 公司信息
      item['info'] = ''.join(response.xpath('//*[@id="textShowMore"]/text()').extract())
      # 好評度
      item['praise'] = response.xpath('//div[@class="swf-contA"]/div/h3/text()').extract_first()
      # 薪資區間
      item['salary_range'] = response.xpath('//div[@class="swf-contB"]/div/h3/text()').extract_first()
      # 公司產品
      item['products'] = response.xpath('//div[@class="mb5"]/a/text()').extract()

      # 融資情況
      data_list = []
      node_list = response.xpath('//div[5]/ul/li')
      for node in node_list:
          temp = {}
          # 融資日期
          temp['date'] = node.xpath('./span[1]/text()').extract_first()
          # 融資狀態
          temp['status'] = node.xpath('./h3/text()').extract_first()
          # 融資金額
          temp['sum'] = node.xpath('./span[2]/text()').extract_first()
          # 投資方
          temp['investors'] = node.xpath('./span[3]/text()').extract_first()

          data_list.append(temp)

      item['financing_situation'] = data_list

      # 公司排名
      data_list = []
      node_list = response.xpath('//div[@class="fs18 honor-box"]/div')
      for node in node_list:
          temp = {}

          key = node.xpath('./a/text()').extract_first()
          temp[key] = int(node.xpath('./span[2]/text()').extract_first())
          data_list.append(temp)

      item['rank'] = data_list

      # 公司地址
      item['address'] = response.xpath('//dl[@class="dlli fs16"]/dd[1]/text()').extract_first()
      # 公司網址
      item['website'] = response.xpath('//dl[@class="dlli fs16"]/dd[2]/a/text()').extract_first()
      # 聯繫方式
      item['contact'] = response.xpath('//div[@class="j-shower1 dn"]/dd/text()').extract_first()
      # qq號碼
      item['qq'] = response.xpath('//dd[@class="cfix"]/span/text()').extract_first()

      # for k,v in item.items():
      #     print k,v
      # print '****************************************'
      yield item

上面代碼中需要注意的地方:

首先有的詳情頁面公司概況的地方是有圖片的,而有的公司是沒有圖片的,這就造成了我們寫的xpath可能遇到沒有圖片的頁面會匹配不到數據,所有就進行了相應的處理.

"""
   有些公司的詳情頁面沒有圖片
   所以頁面的結構有些不同
"""
        # 公司性質
try:
  item['type'] = response.xpath('//div[@class="cfix fs16"]/dl/dd[1]/text()').extract_first().split('/')[0]
except:
  item['type'] = response.xpath('//*[@id="cmp-intro"]/div/div/dl/dd[1]/text()').extract_first().split('/')[0]


# 公司規模

try:
   item['size'] = response.xpath('//div[@class="cfix fs16"]/dl/dd[1]/text()').extract_first().split('/')[1]
except:
  item['size'] = response.xpath('//*[@id="cmp-intro"]/div/div/dl/dd[1]/text()').extract_first().split('/')[1]

其餘的地方,就是正常的使用xpath來提取出自己想要的數據.

數據提取完了,把數據保存成json數據.打開管道文件pipelines.py

import json

class ZhiyoujiPipeline(object):

  def open_spider(self, spider):
      self.file = open('zhiyouji.json', 'w', encoding='utf-8')


  def process_item(self, item, spider):

      data = json.dumps(dict(item), ensure_ascii=False, indent=2)

      self.file.write(data)

      return item

  def close_spider(self, spider):
      self.file.close()

使用管道來保存數據,那我們就要將settings文件中的管道打開:
```
ITEM_PIPELINES = {
 'ZhiYouJi.pipelines.ZhiyoujiPipeline': 300,
}
```
再使用命令運行爬蟲
```
scrapy crawl zhiyouji
```
由於數據太多,運行了一下就強行就項目停止了.下面的json文件中是保存的部分數據,
項目改成scrapy-radis分佈式爬蟲.下一篇上改造過後的項目.