用scrapy框架爬取拉勾網的全站招聘信息

原創

2018-09-17 04:02

## 文章開頭做個說明，拉勾網的反爬機制爲利用scrapy框架的cookie來識別你的身份，所以要在settings裏面的COOKIES_ENABLED = False的註釋打開,然後再全局裏面加上拉勾網自己的cookie信息,然後程序就能運行起來了

DEFAULT_REQUEST_HEADERS = {
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
‘User-Agent’ : ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36’,
‘Cookie’ : ‘_ga=GA1.2.2099113890.1534936734; user_trace_token=20180822191855-287db68b-a5fd-11e8-9d3f-525400f775ce; LGUID=20180822191855-287dbc2a-a5fd-11e8-9d3f-525400f775ce; index_location_city=%E5%8C%97%E4%BA%AC; fromsite=”localhost:63342”; JSESSIONID=ABAAABAAAFCAAEGFD51330D304C813384D17099E6E0ABD4; _gid=GA1.2.233681856.1535974711; _gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1535588512,1535588697,1535589004,1535974711; TG-TRACK-CODE=index_navigation; SEARCH_ID=eaec941faa014acbbe5c2c65ebe8187c; LGSID=20180903193836-e4f5d61f-af6d-11e8-8570-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fqukuailian%2F; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1535974717; LGRID=20180903193838-e64d8ccc-af6d-11e8-8570-525400f775ce’
}


import scrapy
import re
from my_project.items import LagouItem
class LagouSpider(scrapy.Spider):
    name = 'lagou'
    allowed_domains = ['lagou.com']
    start_urls = ['http://lagou.com/']

    def parse(self, response):
        #通過首頁找到所有列表頁url
        res_hrefs = re.findall(r'a href="(.*)" data-lg-tj-id="4O00"',response.text)
        for res_href in res_hrefs:
            yield scrapy.Request(url=res_href,callback=self.parse_details)

    def parse_details(self,response):
        #找到所有詳情頁url
        res_detail = re.findall(r'link" href="(.*)" target="_blank" ',response.text)
        del res_detail[-1]

        for res_info in res_detail:
            yield scrapy.Request(url=res_info,callback=self.parse_info)
        #找到下一頁的url
        next_page = re.findall(r'a href="(.*)" class="page_no"',response.text)
        try:
            yield scrapy.Request(url=next_page[-1],callback=self.parse_details)
        except:
            pass
    def parse_info(self,response):
        # with open('222.html', 'wb') as f:
        #     f.write(response.body)
        #獲取所有想要的信息
        try:
            lg_title = re.findall(r'span class="name">(.*)</',response.text)[0]
            print(lg_title)
            lg_salary = re.findall(r'salary">(.*)</span',response.text)[0]
            print(lg_salary)
            lg_address = re.findall(r'span>/(.*) /</span',response.text)[0]
            print(lg_address)
            lg_experience = re.findall(r'span>(.*) /</span',response.text)[1]
            print(lg_experience)
            lg_study = re.findall(r'pan>(.*) /</sp',response.text)[2]
            print(lg_study)
            lg_create_time = re.findall(r'class="publish_time">(.*)&nbsp; 發佈於拉勾網</p>',response.text)[0]
            print(lg_create_time)
            lg_job_description = response.xpath('//dd[@class="job_bt"]/div/p/text()').extract()
            lg_job_description = ''.join(lg_job_description)
            print(lg_job_description)
            lagou = '拉勾網'
            print(lagou)
        except:
            pass


        item = LagouItem()

        item['lg_title'] = lg_title
        item['lg_salary'] = lg_salary
        item['lg_address'] = lg_address
        item['lg_experience'] = lg_experience
        item['lg_study'] = lg_study
        item['lg_create_time'] = lg_create_time
        item['lg_job_description'] = lg_job_description
        item['lagou'] = lagou

        yield item

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用scrapy框架爬取拉勾網的全站招聘信息

python gdal 安裝使用（Windows， python 3.6.8）

淺談進程和線程的個人理解

爬蟲中requests方法封裝post和get原理

用scrapy框架爬取微博所有人的微博內容的

用scrapy框架爬取拉勾網的全站招聘信息

獲取代理ip的類

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結