使用Spark分析拉勾網招聘信息(二): 獲取數據

要獲取什麼樣的數據?

我們要獲取的數據,是指那些公開的,可以輕易地獲取地數據.如果你有完整的數據集,肯定是極好的,但一般都很難通過還算正當的方式輕易獲取.單就本系列文章要研究的實時招聘信息來講,能獲取最近一個月的相關信息,已是足矣.

如何獲取數據?

爬蟲,也是可以的,作爲一個備選方案.但是,我注意到拉勾網本身的數據,是通過ajax請求更新的,所以批量獲取變得更加簡單.基於ajax請求來獲取數據,方式有很多,這裏我演示其中的自認爲較爲簡單通用的一種: 使用 curl 模擬 ajax 請求獲取數據.

注意,以下的步驟演示全部基於 Mac 版的 * Google Chrome* 瀏覽器,其他瀏覽器部分操作可能會有些許差異.最後一步會給出 提取出的通用 curl 腳本,直接其實也是可以的,如果對步驟不是很關心.

1.找到目標城市和目標職位,然後按”最新排序”,參考鏈接: http://www.lagou.com/jobs/list_iOS?px=new&city=北京#order

0-0.png

2.雙指擊/右擊 頁面,彈出快捷菜單,選擇”檢查”,以進入瀏覽器調試界面,切換到調試器的 network -> xhr 標籤下.

0_1.png

3.cmd + R 刷新頁面,此時會捕捉到此頁面發出的xhr請求.找到 http://www.lagou.com/jobs/positionAjax.json 開頭的請求,並雙指擊/右擊,選擇 copy as cUrl.

這個 curl代碼非常長,對於本次分析來說,最關鍵的是 末尾的 pn=1&kd=iOS,分別代表着頁面和職位,動態設置,即可獲取更多職位的更多數據了,文章的其他篇幅,會單獨分析.

curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data 'first=true&pn=1&kd=iOS' --compressed

0_2.png

4.講上一步中的curl指令複製到終端,橋下回車鍵,即可看到輸出.

{"success":true,"requestId":null,"msg":null,"resubmitToken":null,"content":{"pageNo":1,"pageSize":15,"positionResult":{"totalCount":974,"resultSize":15,"locationInfo":{"city":"北京","district":null,"queryByGisCode":false,"businessZone":null,"locationCode":null},"queryAnalysisInfo":{"positionName":"ios","companyName":null,"usefulCompany":false,"industryName":null},"strategyProperty":{"name":"dm-csearch-newSimScorer","id":1},"result":[{"companyId":129801,"companyShortName":"言之有物科技","createTime":"2016-08-30 19:28:12","positionId":1857486,"positionAdvantage":"一線公司,技術驅動,免費三餐,超期望回報","salary":"25k-50k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"iOS高級研發工程師/Lead","companyLogo":"i/image/M00/43/4E/CgqKkVeDGsuAXz0gAAA4XeGAAHQ390.png","financeStage":"成長型(A輪)","industryField":"移動互聯網,電子商務","jobNature":"全職","approve":1,"companySize":"15-50人","district":null,"companyLabelList":["股票期權","扁平管理","美女多","領導好"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:28發佈","gradeDescription":null,"companyFullName":"北京言之有物科技有限公司","businessZones":null,"imState":"today","lastLogin":1472556472000,"publisherId":5092848,"explain":null,"plus":null,"pcShow":0},{"companyId":133,"companyShortName":"獵豹移動","createTime":"2016-08-30 19:09:34","positionId":2151896,"positionAdvantage":"明星產品 超讚年終獎 靠譜領導","salary":"15k-30k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/39/70/CgYXBlWo3nqABJTsAADJ3hn5gmE062.jpg","financeStage":"上市公司","industryField":"移動互聯網,信息安全","jobNature":"全職","approve":1,"companySize":"500-2000人","district":"朝陽區","companyLabelList":["帶薪年假","美女前臺","超讚年終獎","一公里工作圈"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:09發佈","gradeDescription":null,"companyFullName":"北京金山網絡科技有限公司","businessZones":["姚家園","十里堡","高碑店"],"imState":"today","lastLogin":1472555392000,"publisherId":129969,"explain":null,"plus":null,"pcShow":0},{"companyId":107608,"companyShortName":"MUM計算機","createTime":"2016-08-30 19:03:24","positionId":1963945,"positionAdvantage":"幫助程序員赴美做IT,享受高薪高品質生活","salary":"10k-20k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"IOS程序員赴美項目推廣員","companyLogo":"i/image/M00/00/C2/CgqKkVZVHmSAWPtRAASUg0iUVuI932.jpg","financeStage":"初創型(不需要融資)","industryField":"教育","jobNature":"全職","approve":0,"companySize":"少於15人","district":"昌平區","companyLabelList":["赴美工作","美元薪水","告別996","技術前沿"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:03發佈","gradeDescription":null,"companyFullName":"北京瑪赫西計算機教育諮詢有限公司","businessZones":null,"imState":"disabled","lastLogin":1472558059000,"publisherId":5179699,"explain":null,"plus":null,"pcShow":0},{"companyId":67576,"companyShortName":"車滿滿","createTime":"2016-08-30 18:47:30","positionId":2307877,"positionAdvantage":"期權","salary":"20k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS高級開發工程師","companyLogo":"i/image/M00/01/47/Cgp3O1ZmYACABBpPAAGzVR5S-Ps906.png","financeStage":"成長型(A輪)","industryField":"移動互聯網","jobNature":"全職","approve":1,"companySize":"50-150人","district":"朝陽區","companyLabelList":["股票期權","技能培訓","彈性工作","定期體檢"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:47發佈","gradeDescription":null,"companyFullName":"車滿滿(北京)信息技術有限公司","businessZones":["建外大街","CBD","國貿"],"imState":"today","lastLogin":1472566873000,"publisherId":2116322,"explain":null,"plus":null,"pcShow":0},{"companyId":1575,"companyShortName":"百度","createTime":"2016-08-30 18:30:05","positionId":2307765,"positionAdvantage":"BAT 薪酬福利好","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS移動開發","companyLogo":"image1/M00/00/06/CgYXBlTUWAWAOBXrAABGHHFb0q8748.jpg","financeStage":"上市公司","industryField":"移動互聯網,數據服務","jobNature":"全職","approve":1,"companySize":"2000人以上","district":null,"companyLabelList":["股票期權","彈性工作","五險一金","免費班車"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:30發佈","gradeDescription":null,"companyFullName":"百度在線網絡技術(北京)有限公司","businessZones":null,"imState":"disabled","lastLogin":1472553001000,"publisherId":5705515,"explain":null,"plus":null,"pcShow":0},{"companyId":13321,"companyShortName":"FunPlus 趣加遊戲","createTime":"2016-08-30 18:26:28","positionId":2240276,"positionAdvantage":"國際一線團隊,無限的成長空間,任你發揮","salary":"18k-36k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS 視頻處理工程師/高級工程師","companyLogo":"image1/M00/00/1A/Cgo8PFTUWFWAKE5aAABwJ1mgAYw423.png","financeStage":"成長型(B輪)","industryField":"遊戲","jobNature":"全職","approve":0,"companySize":"150-500人","district":"海淀區","companyLabelList":["績效獎金","股票期權","專項獎金","五險一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:26發佈","gradeDescription":null,"companyFullName":"北京趣加科技有限公司","businessZones":["中關村","知春路","雙榆樹"],"imState":"today","lastLogin":1472552889000,"publisherId":285309,"explain":null,"plus":null,"pcShow":0},{"companyId":15111,"companyShortName":"聯拓天際","createTime":"2016-08-30 18:22:12","positionId":2307696,"positionAdvantage":"與其在別處仰望,不如在這裏並肩","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/00/1D/Cgo8PFTUWGGAZQdjAADRNZVO9fc470.jpg","financeStage":"成熟型(不需要融資)","industryField":"電子商務","jobNature":"全職","approve":1,"companySize":"500-2000人","district":null,"companyLabelList":["五險一金","午餐補助","定期體檢","技能培訓"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:22發佈","gradeDescription":null,"companyFullName":"北京聯拓天際電子商務有限公司","businessZones":null,"imState":"today","lastLogin":1472552392000,"publisherId":1595082,"explain":null,"plus":null,"pcShow":0},{"companyId":119049,"companyShortName":"優久科技","createTime":"2016-08-30 18:15:29","positionId":1853231,"positionAdvantage":"良好的工作環境、成長平臺和工作夥伴","salary":"10k-18k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"i/image/M00/16/74/CgqKkVbvnVuAeC-YAAA_YSPyb5A166.jpg","financeStage":"初創型(天使輪)","industryField":"移動互聯網","jobNature":"全職","approve":0,"companySize":"少於15人","district":"海淀區","companyLabelList":["交通補助","通訊津貼","午餐補助"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:15發佈","gradeDescription":null,"companyFullName":"北京優久科技有限責任公司","businessZones":["中關村","知春路","人民大學"],"imState":"today","lastLogin":1472552013000,"publisherId":4427723,"explain":null,"plus":null,"pcShow":0},{"companyId":41878,"companyShortName":"商詢科技","createTime":"2016-08-30 18:14:06","positionId":2278393,"positionAdvantage":"微軟創業團隊,工程師文化!","salary":"10k-15k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS開發","companyLogo":"i/image/M00/24/22/Cgp3O1cZmpWAGslpAAA9MdgVNWU645.jpg","financeStage":"成長型(A輪)","industryField":"企業服務,數據服務","jobNature":"全職","approve":1,"companySize":"15-50人","district":"朝陽區","companyLabelList":["股票期權","人脈資源","辦公環境好","國際化團隊"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:14發佈","gradeDescription":null,"companyFullName":"北京商詢科技有限公司","businessZones":["姚家園"],"imState":"today","lastLogin":1472554153000,"publisherId":803257,"explain":null,"plus":null,"pcShow":0},{"companyId":5832,"companyShortName":"新浪微博","createTime":"2016-08-30 18:02:30","positionId":254885,"positionAdvantage":"億級別DAU,微博重點項目組","salary":"20k-40k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"新浪微博iOS客戶端研發工程師","companyLogo":"image1/M00/00/0D/CgYXBlTUWCCAdkhOAABNgyvZQag818.jpg","financeStage":"上市公司","industryField":"移動互聯網","jobNature":"全職","approve":0,"companySize":"2000人以上","district":"海淀區","companyLabelList":["年底雙薪","專項獎金","股票期權","五險一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:02發佈","gradeDescription":null,"companyFullName":"微夢創科網絡科技(中國)有限公司","businessZones":["西北旺","馬連窪","上地"],"imState":"disabled","lastLogin":1472556144000,"publisherId":561302,"explain":null,"plus":null,"pcShow":0},{"companyId":48321,"companyShortName":"合廣衆","createTime":"2016-08-30 18:00:40","positionId":2263615,"positionAdvantage":"老闆nice","salary":"10k-20k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS開發工程師","companyLogo":"i/image/M00/01/D6/CgqKkVZ496GAYypzAAAKATKLXuY379.png","financeStage":"初創型(天使輪)","industryField":"移動互聯網","jobNature":"全職","approve":0,"companySize":"50-150人","district":"海淀區","companyLabelList":["節日禮物","帶薪年假","績效獎金","崗位晉升"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:00發佈","gradeDescription":null,"companyFullName":"北京合廣衆文化發展有限公司","businessZones":["八里莊","定慧寺","四季青"],"imState":"today","lastLogin":1472550077000,"publisherId":3608518,"explain":null,"plus":null,"pcShow":0},{"companyId":38239,"companyShortName":"Keep","createTime":"2016-08-30 17:52:25","positionId":2076872,"positionAdvantage":"福利健全、北京工作居住證、C輪","salary":"25k-35k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS開發工程師","companyLogo":"image1/M00/0A/40/CgYXBlTun9KASqKdAAAs36QVurU409.png","financeStage":"成熟型(C輪)","industryField":"社交網絡,文化娛樂","jobNature":"全職","approve":1,"companySize":"150-500人","district":null,"companyLabelList":["節日禮物","年度旅遊","定期體檢","五險一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52發佈","gradeDescription":null,"companyFullName":"北京卡路里科技有限公司","businessZones":null,"imState":"today","lastLogin":1472550738000,"publisherId":3425178,"explain":null,"plus":null,"pcShow":0},{"companyId":179,"companyShortName":"她理財","createTime":"2016-08-30 17:52:02","positionId":982402,"positionAdvantage":"五險一金 績效獎金 年底15薪 帶薪年假","salary":"15k-25k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"高級iOS開發工程師","companyLogo":"image1/M00/0C/F2/CgYXBlT2mG2AOPevAAB_09mD2Ko247.png","financeStage":"成長型(A輪)","industryField":"電子商務,金融","jobNature":"全職","approve":1,"companySize":"50-150人","district":"朝陽區","companyLabelList":["年底雙薪","節日禮物","技能培訓","績效獎金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52發佈","gradeDescription":null,"companyFullName":"北京新工場投資顧問有限公司","businessZones":["大望路","華貿","百子灣"],"imState":"today","lastLogin":1472557005000,"publisherId":97147,"explain":null,"plus":null,"pcShow":0},{"companyId":11053,"companyShortName":"中科三方","createTime":"2016-08-30 17:33:13","positionId":2307276,"positionAdvantage":"留用機會,戶口指標","salary":"2k-4k","score":0,"workYear":"應屆畢業生","education":"本科","city":"北京","positionName":"iOS實習生","companyLogo":"image1/M00/00/16/CgYXBlTUWEWAXnWbAACvz96W4qA927.jpg","financeStage":"成長型(不需要融資)","industryField":"移動互聯網","jobNature":"實習","approve":0,"companySize":"150-500人","district":"海淀區","companyLabelList":null,"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:33發佈","gradeDescription":null,"companyFullName":"北京中科三方網絡技術有限公司","businessZones":["中關村","知春路","雙榆樹"],"imState":"today","lastLogin":1472549621000,"publisherId":141237,"explain":null,"plus":null,"pcShow":0},{"companyId":116183,"companyShortName":"情非得已","createTime":"2016-08-30 17:28:11","positionId":1786957,"positionAdvantage":"五險一金、無限小吃、Mac辦公、定期體檢","salary":"8k-15k","score":0,"workYear":"1-3年","education":"不限","city":"北京","positionName":"android&iOS測試工程師","companyLogo":"i/image/M00/1C/58/CgqKkVcB1QyAJM2-AAA4t6tVzs8439.jpg","financeStage":"初創型(天使輪)","industryField":"移動互聯網,企業服務","jobNature":"全職","approve":0,"companySize":"15-50人","district":"朝陽區","companyLabelList":["定期體檢","年度旅遊","領導好","扁平管理"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:28發佈","gradeDescription":null,"companyFullName":"情非得已(北京)科技有限公司","businessZones":["建外大街","國貿","CBD"],"imState":"today","lastLogin":1472553855000,"publisherId":4170237,"explain":null,"plus":null,"pcShow":0}]}},"code":0}

可以看到,與網站的第一頁獲取的實際數據是完全對應的.

如何將數據保存爲文件?

將curl的結果,直接保存爲文件,才方便進一步處理,方法就是使用重定向符 >,以下代碼,講curl的結果,不是在控制器輸出,而是保存到指定文件 1.json

curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data 'first=true&pn=1&kd=iOS' --compressed > 1.json

如何獲取其他職位的數據?

此處需要一點更深入些的shell語法,簡單說,需要一個for in 循環來遍歷一組給定的職位,動態更改 前面curl腳本中的 末尾的kd屬性的值,並寫入職位對應的文件中,注意 末尾 –data後的 單引號對,要改成雙引導對,否則無法應用變量.完整代碼如下,職位數組,可按需自行添加:

for kd in "Java" "PHP" "C" "C++" "Android" "iOS"
do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=1&kd=$kd" --compressed > $kd.json 
done  

如何批量獲取?

curl 腳本,現在是每次只可以獲取單頁,要想獲取多頁,加個for循環就可以了.經過觀察,拉勾有效數據大概最多在100頁左右,所以寫個1~100的循環,並以 kd_ pn.json 的格式保存:

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json
done  
done

如何提高獲取速度?

如果你運行了上面的腳本,如你所見,似乎有點太慢,因爲curl請求是同步執行的,必須一條下載完成後,纔會繼續執行下面的代碼.可以藉助 & 符 異步同時獲取多個請求,來提高速度.另外需要注意的一點是:一個電腦,能同時創建的 curl 鏈接是有限的,爲了避免不必要的中斷,加了個極短的sleep,改進後的代碼如下:

注意: 此處代碼,可能會導致您的ip被lagou封閉,如果不是太趕時間的話,慎用;當然,你可以多換幾個ip.

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json &
sleep 0.02
done  
done

注意: 如果一直卡住不動,可以 ctrl + c 退出;如果總是異常腳本中斷,可以嘗試將 sleep 後的數值調大.

一個更完整的腳本

此處,單獨將數據放到 jobs目錄,以便於組織目錄結構,完整數據可異步文首的github項目中下載:

mkdir jobs
for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > jobs/$kd\_$pn.json &
sleep 0.02
done  
done

另外,你可能會發現,部分職位並沒有100頁的有效數據,那是否需要額外處理這些數據呢?當然是沒有的.Spark等大數據分析工具的一個基本功能就是適度數據集容錯.部分異常數據,一般是不會影響數據本身的導入的.導入後,直接分析即可.這都是後話,此係列後面的文章會單獨講述的.


本系列專屬github地址:https://github.com/ios122/spark_lagou

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章