獲取拉鉤網招聘數據

平常在找工作的時候,經常會使用到拉鉤網,比如搜索關鍵字“自動化測試工程師”,然後就會顯示很多的招聘信息,那麼如何批量的獲取這些招聘信息並對這些信息進行整個的數據分析了? 如果我們能夠拿到批量的數據,並且對這些數據進行分析,比如最高薪資,最低薪資,招聘自動化測試要求必須掌握的工作內容等等。那麼獲取到這些數據後,經過分析對我們還是很有參考價值的,那麼今天晚上就先來實現第一部分,在拉鉤網進行關鍵字搜索,搜索後,拿到自動化測試工程師招聘列表的信息,包含每一頁的信息,以及總共多少頁的信息,搜索後,進行翻頁,拉鉤網上面的URL是不會發生變化的,但是它會進行ajax發送請求的,也就是說針對這些動態網站的數據獲取的方式,見翻頁得到的請求信息,可以得到如下的信息:

請求地址:

https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false

請求頭:

Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: _ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAADEAAFI097BA2BE39D3B0D0BEA1C82AE832AF02; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539785285,1540794503,1540819902,1540905505; _gat=1; LGSID=20181030211826-48e29064-dc46-11e8-8467-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; SEARCH_ID=389112e1ab2640b098233a552d502745; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540905515; LGRID=20181030211836-4eaa43a5-dc46-11e8-b7be-525400f775ce
DNT: 1
Host: www.lagou.com
Origin: https://www.lagou.com
Referer: https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput=
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36

請求參數:

請求方法:POST

在如上的信息中,可以得到它的請求方法是post,請求參數中pn是代表頁數,kd是搜索的關鍵字參數,那麼我們先來獲取每一頁它的招聘列表的數據,實現的源碼爲:

def getHeaders():
   headers={
      'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
      'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
      'Cookie':'_ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAAGFABEF93F47251563A52306423D37E945D2C54; _gat=1; LGSID=20181029213144-fa3c8e13-db7e-11e8-b51c-525400f775ce; PRE_UTM=; PRE_HOST=www.bing.com; PRE_SITE=https%3A%2F%2Fwww.bing.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539529521,1539785285,1540794503,1540819902; SEARCH_ID=ae3ae41a58d94802a68e848d36c30711; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540819909; LGRID=20181029213151-fe7324dc-db7e-11e8-b51c-525400f775ce',
      'Referer':'https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95'}
   return headers


def laGou(url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',page=2):
   positions = []
   r = requests.post(
      url=url,
      headers=getHeaders(),
      data={'first': False, 'pn': page, 'kd': '自動化測試工程師'})
   for i in range(15):
      city = r.json()['content']['positionResult']['result'][i]['city']
      education = r.json()['content']['positionResult']['result'][i]['education']
      workYear = r.json()['content']['positionResult']['result'][i]['workYear']
      positionAdvantage = r.json()['content']['positionResult']['result'][i]['positionAdvantage']
      salary = r.json()['content']['positionResult']['result'][i]['salary']
      companyFullName = r.json()['content']['positionResult']['result'][i]['companyFullName']
      positionLables = r.json()['content']['positionResult']['result'][i]['positionLables']
      position = {
         '公司名稱': companyFullName,
         '城市': city,
         '學歷': education,
         '工作年限': workYear,
         '薪資': salary,
         '工作標籤': positionLables,
         '福利': positionAdvantage
      }
      positions.append(position)
   for item in positions:
      print(item)

注:在上面的源碼中,page參數代表的是頁數,我們可以隨意的指定,調用函數laGou()後,就會打印出如上獲取到的招聘信息,如公司,薪資等信息,見調用laGou()函數後打印的數據截圖:

在上面中實現了每一頁的招聘數據,下來來實現關鍵字搜索後所有頁數的招聘數據,“自動化測試工程師”搜索後得到的頁面是30頁,如下圖所示:

那麼我們調用laGou()函數,在執行該函數的時候,給它的參數page傳不同的值來實現,見實現的源碼:

for item in range(1, 31):
   laGou(page=item)

上面的代碼相對來說就非常簡單了。下面見實現的所有源碼:

#!/use/bin/env python
#coding:utf-8 

#Author:WuYa

import  csv
import  requests

def getHeaders():
   headers={
      'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
      'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
      'Cookie':'_ga=GA1.2.1237290736.1534169036; user_trace_token=20180813220356-b7e42516-9f01-11e8-bb78-525400f775ce; LGUID=20180813220356-b7e428ad-9f01-11e8-bb78-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.675811712.1540794503; JSESSIONID=ABAAABAAAGFABEF93F47251563A52306423D37E945D2C54; _gat=1; LGSID=20181029213144-fa3c8e13-db7e-11e8-b51c-525400f775ce; PRE_UTM=; PRE_HOST=www.bing.com; PRE_SITE=https%3A%2F%2Fwww.bing.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539529521,1539785285,1540794503,1540819902; SEARCH_ID=ae3ae41a58d94802a68e848d36c30711; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1540819909; LGRID=20181029213151-fe7324dc-db7e-11e8-b51c-525400f775ce',
      'Referer':'https://www.lagou.com/jobs/list_%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95'}
   return headers


def laGou(url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false',page=2):
   positions = []
   r = requests.post(
      url=url,
      headers=getHeaders(),
      data={'first': False, 'pn': page, 'kd': '自動化測試工程師'})
   for i in range(15):
      city = r.json()['content']['positionResult']['result'][i]['city']
      education = r.json()['content']['positionResult']['result'][i]['education']
      workYear = r.json()['content']['positionResult']['result'][i]['workYear']
      positionAdvantage = r.json()['content']['positionResult']['result'][i]['positionAdvantage']
      salary = r.json()['content']['positionResult']['result'][i]['salary']
      companyFullName = r.json()['content']['positionResult']['result'][i]['companyFullName']
      positionLables = r.json()['content']['positionResult']['result'][i]['positionLables']
      position = {
         '公司名稱': companyFullName,
         '城市': city,
         '學歷': education,
         '工作年限': workYear,
         '薪資': salary,
         '工作標籤': positionLables,
         '福利': positionAdvantage
      }
      positions.append(position)
   for item in positions:
      print(item)

if __name__ == '__main__':
   for item in range(1, 31):
      laGou(page=item)

如上,我們通過Requests的庫就輕易的實現了獲取拉鉤網某個搜索關鍵字的招聘信息。當然還需要做的很多。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章