scrapy爬取--騰訊社招的網站

需求:得到相應的職位、職位類型、職位的鏈接 、招聘人數、工作地點、發佈時間

一、創建Scrapy項目的流程

1)使用命令創建爬蟲騰訊招聘的職位項目:scrapy startproject tencent

2)進程項目命令:cd tencent,並且創建爬蟲:scrapy genspider tencentPosition hr.tencent.com

3) 使用PyCharm打開項目

4)根據需求分析,完成items.py文件的字段

5)完成爬蟲的編寫

6)管道文件的編程

7)settings.py文件的配置信息

8)pycharm打開文件的效果圖:

二、編寫各個文件的代碼:

1.tencentPosition.py文件

import scrapy

from tencent.items import TencentItem


class TencentpositionSpider(scrapy.Spider):
    name = 'tencentPosition'
    allowed_domains = ['hr.tencent.com']
    offset = 0
    url = "https://hr.tencent.com/position.php?&start="
    start_urls = [url + str(offset) + '#a', ]

    def parse(self, response):
        position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
        for postion in position_lists:
            item = TencentItem()
            position_name = postion.xpath("./td[1]/a/text()").extract()[0]
            position_link = postion.xpath("./td[1]/a/@href").extract()[0]
            position_type = postion.xpath("./td[2]/text()").get()
            people_num = postion.xpath("./td[3]/text()").extract()[0]
            work_address = postion.xpath("./td[4]/text()").extract()[0]
            publish_time = postion.xpath("./td[5]/text()").extract()[0]

            item["position_name"] = position_name
            item["position_link"] = position_link
            item["position_type"] = position_type
            item["people_num"] = people_num
            item["work_address"] = work_address
            item["publish_time"] = publish_time
            yield item

            # 下一頁的數據
            total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0]
            print(total_page)

            if self.offset < int(total_page):
                self.offset += 10
            new_url = self.url + str(self.offset) + "#a"
            yield scrapy.Request(new_url, callback=self.parse)

2.items.py 文件

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    position_name = scrapy.Field()
    position_link = scrapy.Field()
    position_type = scrapy.Field()
    people_num = scrapy.Field()
    work_address = scrapy.Field()
    publish_time = scrapy.Field()

*****切記字段和TencentpositionSpider.py文件保持一致

3.pipelines.py文件

import json


class TencentPipeline(object):
    def __init__(self):
        print("=======start========")
        self.file = open("tencent.json", "w", encoding="utf-8")

    def process_item(self, item, spider):
        print("=====ing=======")
        dict_item = dict(item)  # 轉換成字典
        json_text = json.dumps(dict_item, ensure_ascii=False) + "\n"
        self.file.write(json_text)
        return item

    def close_spider(self, spider):
        print("=======end===========")
        self.file.close()

4.settings.py文件

5.運行文件:

1)在根目錄下創建一個main.py

2)main.py文件

from scrapy import cmdline

cmdline.execute("scrapy crawl tencentPosition".split())

三、運行效果:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章