再爬拉钩，直接忽略反爬！Selenium+Xpath+re 可见可爬

原創

2020-03-10 05:24

再爬拉钩，直接忽略反爬！Selenium+Xpath+re 可见可爬

之前我写过一篇博python成功爬取拉勾网——初识反爬（一个小白真实的爬取路程，内容有点小多）这是我第一次对具有多种反爬措施的网站进行的爬取，之前爬取的大都是简单的定点静态网页爬取（练习简单的python网络爬虫库的使用），所以遇到了一大波挫折，看了很多大佬的博客后才勉强解决，当然现在已经可以很好的理解拉钩数据加载的方式（Ajax动态加载数据）和反爬措施有很好的了解啦😁

最近学习了Selenium自动化测试库，就想尝试的用这种方法对拉钩再次进行爬取，体会其中的不同，使用这种库当然是因为它模拟了浏览器的浏览和人为的点击输入，不需要对网页请求响应的分析，更不需要构造头部请求，所以便想试一试。因为不需要过多的分析，所以直接上代码！

导入需要的库：

from selenium import webdriver
import time
import lxml
from lxml import etree
import re

在主方法中进行模拟浏览器浏览和点击，若是放在某一个方法中可能会出现可以模拟浏览成功，但可能出现秒退出的结果：

if __name__ == '__main__':
    url = 'https://www.lagou.com/'
    #login(url)
    #初始化浏览器
    driver = webdriver.Chrome()
    #跳转到目标网页
    driver.get(url)

有图可以看到，网页跳转后弹出了一个城市选择框，影响了我们对网页源码额获取，所以我先找到关闭按钮的源码，找到他并模拟点击，关闭弹出窗口：

#获取关闭弹出框的按钮
#<button type="button" id="cboxClose">close</button>
button_close = driver.find_element_by_id('cboxClose')
#关闭弹出窗口
button_close.click()

这样就可关闭弹出窗口啦，接下来就是获取输入框，向输入框中输入自己想要查询的关键字，点击搜索按钮：

根据图中所示的Element元素的关键属性进行对按钮和输入框的锁定：

#等待1秒，网页源码的响应
time.sleep(1)
#keywards = input('请输入你想查找的职位信息：')
input = driver.find_element_by_id('search_input')
input.send_keys('python网络爬虫')
button_search = driver.find_element_by_id('search_button')
button_search.click()

这样就完成了对关键字的搜索，当浏览器自动话打开后，又有出现的无关的弹窗，还是对其分析，将弹框关闭，可能每个人遇到的情况不同，可能没有弹框出现，所以我在此就不放效果截图了，直接给出我关闭弹窗的代码，之后获取当前的网页源码：

#<div class="body-btn">给也不要</div>
button_btn = driver.find_element_by_class_name('body-btn')
button_btn.click()
time.sleep(1)
page_source = driver.page_source

最后对你想要的信息分析即可，这里我才用了re和Xpath的方法，为了对二者加强练习：

可以看出每个职位的所有信息都放在div标签当中:

def search_information(page_source):
    tree = etree.HTML(page_source)
    #<h3 style="max-width: 180px;">网络爬虫工程师</h3>
    position_name = tree.xpath('//h3[@style="max-width: 180px;"]/text()')
    #<span class="add">[<em>北京·小营</em>]</span>
    position_location = tree.xpath('//span[@class="add"]/em/text()')
    #<span class="format-time">17:15发布</span>
    position_report_time = tree.xpath('//span[@class="format-time"]/text()')
    #<span class="money">8k-15k</span>
    positon_salary = tree.xpath('//span[@class="money"]/text()')
    #position_edution = tree.xpath('//div[@class="li_b_l"]/text()')
    position_edution = re.findall('<div.*?class="li_b_l">(.*?)</div>',str(page_source),re.S)
    position_result_edution = sub_edution(position_edution)
    position_company_name = tree.xpath('//div[@class="company_name"]/a/text()')
    position_company_href = tree.xpath('//div[@class="company_name"]/a/@href')
    position_company_industry = tree.xpath('//div[@class="industry"]/text()')
    position_company_industry_result = sub_industry(position_company_industry)
    #<div class="li_b_r">“免费早午餐+免费班车+五险两金+年终奖”</div>
    position_good = tree.xpath('//div[@class="li_b_r"]/text()')

    for i in range(len(position_company_name)):
        print("职位名称：{}".format(position_name[i]))
        print("公司位置：{}".format(position_location
                               [i]))
        print("信息发布时间：{}".format(position_report_time[i]))
        print("职位薪资：{}".format(positon_salary[i]))
        print("职位要求：{}".format(position_result_edution[i]))
        print("公司名称：{}".format(position_company_name[i]))
        print("公司规模：{}".format(position_company_industry_result[i]))
        print("公司福利:{}".format(position_good[i]))
        print("公司链接：{}".format(position_company_href[i]))
        print('-----------------------------------------------------------------------')

对正则表达式返回来的内容含有空格符进行处理：

def sub_edution(list):
    a =[]
    result = []
    for i in list:
        one = re.sub('\n', '', i)
        two = re.sub(' <span.*?>.*?</span>', '', one)
        three = re.sub('<!--<i></i>-->', '', two)
        a.append(three)
    for i in a[::2]:
        result.append(i)
    return result

def sub_industry(list):
    result = []
    for i in list:
        a = re.sub('\n','',i)
        result.append(a)
    return result

最后后台打印的结果：
在此我只对一页进行了爬取，共有30页的信息，兄的们可以自己试一试多页爬取，很简单的，观察一下不同页数URL链接的不同即可

谢谢大家的阅读😊

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

再爬拉钩，直接忽略反爬！Selenium+Xpath+re 可见可爬

再爬拉钩，直接忽略反爬！Selenium+Xpath+re 可见可爬

Python多線程爬蟲—批量爬取豆瓣電影動態加載的電影信息（小白詳細說明自己對於多線程瞭解）

Python幫你玩轉Excel文檔之xlwt模塊創建Excel文檔（基本操作）

（2020年）解決報錯：SyntaxError: Non-UTF-8 code starting with '\xe6' in file

Python幫你玩轉Excel文檔之xlrd模塊的基本詳細操作

Python—Queue模塊基本使用方法詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結