使用Scrapy爬騰訊社會招聘網站上的崗位需求

需求

爬蟲的設計需求是,爬取騰訊招聘網站社會招聘的崗位需求,按照字段:崗位,國家,城市,事業羣,崗位類別,崗位職責,發佈時間,詳細描述保存到數據庫。
目標地址騰訊招聘

頁面分析

在瀏覽器中打開目標網頁,F12開始抓包。
在這裏插入圖片描述
從抓包結果可以看出,頁面是通過Ajax和後端交互的,渲染當前頁面的用到了倆個後端接口,GetMultiDictionary和Query。

  • GetMultiDictionary

     獲取頁面左邊的事業羣。
    
  • Query

     獲取右邊的職位。
    

Query接口的返回結果,是我需要的。

URL:https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1581438061100&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

result:{"Code":200,"Data":{"Count":4221,"Posts":[{"Id":0,"PostId":"1123176672162484224","RecruitPostId":47753,"RecruitPostName":"18302-新動作手遊後臺開發工程師(深圳)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"技術","Responsibility":"負責新動作手機遊戲的服務器端系統開發工作;\n負責部分遊戲服務器端的架構工作;\n負責服務器端部分的性能優化工作;\n根據需要可能會負責部分前端的功能開發工作;","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1123176672162484224","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1215570947008892928","RecruitPostId":56702,"RecruitPostName":"34975-高級海外遊戲數據分析","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"產品","Responsibility":"負責天美旗下海外重點產品數據分析,提供調優建議,併爲後續國際化產品設計沉澱認知;\n日常遊戲運營數據監控及問題分析;\n針對海外市場洞察與用戶反饋提供假設並分析驗證;\n通過數據分析提供版本與運營活動優化建議;\n長期沉澱基於數據分析的國際化產品研發與運營經驗。\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1215570947008892928","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1212989681885515776","RecruitPostId":56503,"RecruitPostName":"18302-動作品類玩法策劃專家(深圳)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"產品","Responsibility":"關注市場的熱點和前沿動作產品,能夠基於核心玩法輸出高水準的產品分析報告並發現新的動作產品機會;\n參與攻堅產品立項、定位,以及前期玩法搭建;\n參與遊戲核心玩法和整體架構的設計,並對其進行論證和優化;\n協同程序、美術等其他部門合作,推動遊戲核心玩法的實現以及論證,達到最終的設計效果;\n參與研究用戶和市場的偏好,探索動作品類前進的方向;\n關注用戶反饋,準確地發現產品玩法問題並予以解決。\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1212989681885515776","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1171396248633085952","RecruitPostId":53270,"RecruitPostName":"18302-國際IP-日語海外PM(深圳)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"產品","Responsibility":"負責項目資源的協調和組織,確保項目團隊各干係人協同工作;\n負責項目計劃的制定,跟蹤和維護,確定項目按計劃進行;\n負責組織項目各項評審會議及項目例會;\n協調項目資源配全,確保項目任務有序推進;\n及時發現並跟蹤解決項目問題,有效管理項目風險。","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1171396248633085952","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1207609468850802688","RecruitPostId":56061,"RecruitPostName":"30933-FFW-Lead Game Narrative Designer (Los Angeles)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"產品","Responsibility":"Build up a narrative team for an AAA game title and define high efficiency work flow for narrative design;\nContribute to the narrative development of game stories, lore, quests, etc;\nCollaborate with your team and game designers to create, iterate on outstanding experience of storytelling and narrative contents.\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1207609468850802688","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1207612736846958592","RecruitPostId":56063,"RecruitPostName":"30933-FFW-Storyboard Artist(Los Angeles)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"設計","Responsibility":"Create storyboard sequences from rough storyboard panels through finished storyboard sequences that serve narrative/ storytelling objectives;\nEnsure that the vision and style stay consistent throughout the show, staging, character acting in storyboard work. If needed, making drawing or text changes in description, dialog or numbering to offer clear description;\nGood understanding for perspective and knows how to utilize it to create space in the storyboard;\nGood understanding for timing and knows how to import images, audio files and make animatic using Storyboard Pro or other equivalent software.\nKnow how camera works and able to construct a good flow of it into the animatic.\nFollow production’s guidelines and properly document every iteration.\nThe ability to work well within a team environment.\nCollaborate with narrative director and production manager to setup goals and schedules for the storyboards.\nRegularly meet with Director, Producer, and other Storyboard Artists to review, execute and revise storyboards.\nOversee the implementation and provide with necessary feedback and solutions.\nAttend and contribute to relevant meetings and pitches as needed, specially script meetings and narrative meetings.\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1207612736846958592","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1207612736465276928","RecruitPostId":56062,"RecruitPostName":"30933-FFW-Senior Character Concept Artist(Los Angeles)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"設計","Responsibility":"Develop character concepts and costume designs;\nIterate on game asset designs with our internal and external team using ideation sketches. paint overs, and deliver final design with great rendering for visual target.","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1207612736465276928","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1227208414266920960","RecruitPostId":57173,"RecruitPostName":"29912-內容運營","CountryName":"中國","LocationName":"北京","BGName":"PCG","ProductName":"微視","CategoryName":"內容","Responsibility":"1、監控全網新聞熱點,組織內容生產,及時挖掘重點新聞,對新聞線索及時作出判斷;\n2、結合熱點進行選題策劃,進行大事件運營;\n3、聯動甲方媒體,進行對接,合作策劃;\n\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1227208414266920960","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1204296465451585536","RecruitPostId":55727,"RecruitPostName":"30933-AOV商業化運營(深圳)","CountryName":"中國","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"產品","Responsibility":"制定商業化運營規劃,把控整體商業化節奏,確保收入目標達成;\n負責遊戲內商業化系統規劃及具體落地,跟進系統數據,整合玩家建議,並提出優化建議;\n負責收入方面運營數據和用戶反饋的收集與分析,不斷優化遊戲商業化體系。\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1204296465451585536","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1227248031292723200","RecruitPostId":57176,"RecruitPostName":"30359-燈塔-Java後臺開發高級工程師","CountryName":"中國","LocationName":"深圳","BGName":"PCG","ProductName":"","CategoryName":"技術","Responsibility":"1、負責PCG大數據實時傳輸、存儲、實時計算、即時查詢計算等底層基礎支撐框架的開發和運營;\n 2、負責PCG大數據產品底層OLAP分析引擎的開發和運營。","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=0","SourceID":1,"IsCollect":false,"IsValid":true}]}}    

Item實現

class TencentItem(scrapy.Item):
    #崗位名稱
    RecruitPostName = scrapy.Field()
    #國家
    CountryName = scrapy.Field()
    #地址
    LocationName = scrapy.Field()
    #事業羣
    BGName = scrapy.Field()
    #崗位類別
    CategoryName = scrapy.Field()
    #崗位職責
    Responsibility = scrapy.Field()
    #發佈時間
    LastUpdateTime = scrapy.Field()

pipelines實現

class TencentPipeline(object):
    #功能:保存item數據 
    def __init__(self):
        print("Pipeline Initialization complete")

    def process_item(self, item, spider):
        db = MySQLdb.connect("localhost","root","sa","spider")
        cursor = db.cursor()
        db.set_character_set('utf8')
        cursor.execute('SET NAMES utf8;')
        cursor.execute('SET CHARACTER SET utf8;')
        cursor.execute('SET character_set_connection=utf8;')
        sql = "INSERT INTO `tencentpostion` (     \
                    `recruitPostName`,     \
                    `countryName`,         \
                    `locationName`,         \
                    `bgName`,\
                    `categoryName`,\
                    `responsibility`,\
                    `lastUpdateTime`,\
                    `PostURL`\
                )\
                VALUES\
                    (\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s'\
                    )"%(item['RecruitPostName'],
                        item['CountryName'],
                        item['LocationName'],
                        item['BGName'],
                        item['CategoryName'],
                        item['Responsibility'],
                        item['LastUpdateTime'],
                        item['PostURL']
                        )
        try:
            cursor.execute(sql)
            db.commit()
        except MySQLdb.Error:
            print("some error occured")
        db.close()
        return item

    def close_spider(self, spider):
        #self.filename.close()
        print("close spider")

Spider實現

# -*- coding: utf-8 -*-
#爬騰訊社招網站
import scrapy
import time
import json
from WHNews.items import TencentItem
class TencentpostionSpider(scrapy.Spider):
    name = 'tencentPostion'
    allowed_domains = ['tencent.com']
    url = "https://careers.tencent.com/tencentcareer/api/post/Query?"
    offset = 1
    # 起始url
    nowTime = time.time();
    timestamp = int(round(nowTime * 1000))
    url_suffix = "timestamp="+str(timestamp)+"cityId=&"+"bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageSize=10&language=zh-cn&area=cn&pageIndex="
    url = url + url_suffix
    start_urls = [url + str(offset)]
    def parse(self, response):
        jsonBody = json.loads(response.body)
        posts = jsonBody['Data']['Posts']
        for dict in posts:
            modelItem = TencentItem()
            modelItem['RecruitPostName'] = dict['RecruitPostName']
            modelItem['CountryName'] = dict['CountryName']
            modelItem['LocationName'] = dict['LocationName']
            modelItem['BGName'] = dict['BGName']
            modelItem['CategoryName'] = dict['CategoryName']
            modelItem['Responsibility'] = dict['Responsibility']
            modelItem['LastUpdateTime'] = dict['LastUpdateTime']
            modelItem['PostURL'] = dict['PostURL']
            yield modelItem
        if self.offset < 417:
            self.offset = self.offset + 1
        # 每次處理完一頁的數據之後,重新發送下一頁頁面請求
        # self.offset自增10,同時拼接爲新的url,並調用回調函數self.parse處理Response
        yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

注意點:

  1. 後端返回的是JSON串,所有Xpath就沒用了。需要引入JSON包, 把返回結果轉出爲JSON對象處理。
jsonBody = json.loads(response.body)
  1. 參數中需要時間戳

結果展示

在這裏插入圖片描述

發佈了25 篇原創文章 · 獲贊 6 · 訪問量 1萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章