scrapy爬取--騰訊社招的網站

需求：得到相應的職位、職位類型、職位的鏈接、招聘人數、工作地點、發佈時間

一、創建Scrapy項目的流程

1）使用命令創建爬蟲騰訊招聘的職位項目：scrapy startproject tencent

2）進程項目命令：cd tencent,並且創建爬蟲：scrapy genspider tencentPosition hr.tencent.com

3) 使用PyCharm打開項目

4）根據需求分析，完成items.py文件的字段

5）完成爬蟲的編寫

6）管道文件的編程

7）settings.py文件的配置信息

8）pycharm打開文件的效果圖：

二、編寫各個文件的代碼：

1.tencentPosition.py文件

import scrapy

from tencent.items import TencentItem


class TencentpositionSpider(scrapy.Spider):
    name = 'tencentPosition'
    allowed_domains = ['hr.tencent.com']
    offset = 0
    url = "https://hr.tencent.com/position.php?&start="
    start_urls = [url + str(offset) + '#a', ]

    def parse(self, response):
        position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
        for postion in position_lists:
            item = TencentItem()
            position_name = postion.xpath("./td[1]/a/text()").extract()[0]
            position_link = postion.xpath("./td[1]/a/@href").extract()[0]
            position_type = postion.xpath("./td[2]/text()").get()
            people_num = postion.xpath("./td[3]/text()").extract()[0]
            work_address = postion.xpath("./td[4]/text()").extract()[0]
            publish_time = postion.xpath("./td[5]/text()").extract()[0]

            item["position_name"] = position_name
            item["position_link"] = position_link
            item["position_type"] = position_type
            item["people_num"] = people_num
            item["work_address"] = work_address
            item["publish_time"] = publish_time
            yield item

            # 下一頁的數據
            total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0]
            print(total_page)

            if self.offset < int(total_page):
                self.offset += 10
            new_url = self.url + str(self.offset) + "#a"
            yield scrapy.Request(new_url, callback=self.parse)

2.items.py 文件

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    position_name = scrapy.Field()
    position_link = scrapy.Field()
    position_type = scrapy.Field()
    people_num = scrapy.Field()
    work_address = scrapy.Field()
    publish_time = scrapy.Field()

*****切記字段和TencentpositionSpider.py文件保持一致

3.pipelines.py文件

import json


class TencentPipeline(object):
    def __init__(self):
        print("=======start========")
        self.file = open("tencent.json", "w", encoding="utf-8")

    def process_item(self, item, spider):
        print("=====ing=======")
        dict_item = dict(item)  # 轉換成字典
        json_text = json.dumps(dict_item, ensure_ascii=False) + "\n"
        self.file.write(json_text)
        return item

    def close_spider(self, spider):
        print("=======end===========")
        self.file.close()

4.settings.py文件

5.運行文件：

1）在根目錄下創建一個main.py

2)main.py文件

from scrapy import cmdline

cmdline.execute("scrapy crawl tencentPosition".split())

三、運行效果：

scrapy爬取--騰訊社招的網站

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

【轉】[C#] WebAPI 防止併發調用二（冥等性）

爬蟲兩種繞過5s盾的方法

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

centos7.5 部署flask+nginx+uwsgi+python3

MySQ--語句大全

python--鏈接kafka

Python3--監控疫情

python3對接微信小程序藍牙

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結