【Python行業分析】BOSS直聘招聘信息獲取之使用webdriver進行爬取

序

進行網頁數據爬取的方式有很多，我前面使用了requests模塊添加瀏覽器的cookies來對頁面數據進行爬取的，那我們是不是可以直接使用瀏覽器來獲取數據呢，當然是可以的。

雖然boss對這種自動化測試軟件也是做了限制的，但是比上一版的爬蟲程序來說還是可以爬到更多的數據，BOSS的反爬策略：

前幾次使用自動化爬取時會提示進行活體驗證
再繼續爬的話就會配限制訪問，當然不會影響你登錄的用戶
密碼登錄這個時候也會被策略限制，智能掃碼登錄了

可以操作瀏覽器的模塊

WebDriver

導入瀏覽器驅動，用get方法打開瀏覽器，例如：

import time
from selenium import webdriver

def mac():
    driver = webdriver.Chrome() //打開Chrome瀏覽器
	# driver = webdriver.Firefox() //打開Firefox瀏覽器
	# driver = webdriver.Ie() //打開IE瀏覽器
    driver.implicitly_wait(5)
    driver.get("http://www.baidu.com")

chromedriver.exe的下載地址爲 http://chromedriver.storage.googleapis.com/index.html
選擇版本是看下安裝的瀏覽器版本，需要和它是一致的。

webbrowser

通過導入python的標準庫webbrowser打開瀏覽器，例如：

import webbrowser
webbrowser.open("C:\\Program Files\\Internet Explorer\\iexplore.exe")
webbrowser.open("C:\\Program Files\\Internet Explorer\\iexplore.exe")

Splinter

Splinter的使用必修依靠Cython、lxml、selenium這三個軟件。所以，安裝前請提前安裝
Cython、lxml、selenium。以下給出鏈接地址：
1）http://download.csdn.net/detail/feisan/4301293
2）http://code.google.com/p/pythonxy/wiki/AdditionalPlugins#Installation_no
3）http://pypi.python.org/pypi/selenium/2.25.0#downloads
4）http://splinter.cobrateam.info/

#coding=utf-8  
import time  
from splinter import Browser  
  
def splinter(url):  
    browser = Browser()  
    #login 126 email websize  
    browser.visit(url)  
    #wait web element loading  
    time.sleep(5)  
    #fill in account and password  
    browser.find_by_id('idInput').fill('xxxxxx')  
    browser.find_by_id('pwdInput').fill('xxxxx')  
    #click the button of login  
    browser.find_by_id('loginBtn').click()  
    time.sleep(8)  
    #close the window of brower  
    browser.quit()  
  
if __name__ == '__main__':  
    websize3 ='http://www.126.com'  
    splinter(websize3)

我們這次使用的是WebDriver

WebDriver常用方法

可以使用get方法請求網頁
find_element可以查找元素
find_element_by_xx 提供對 id、name、class_name等的查詢
send_keys 輸入
click 點擊按鈕、連接等
text獲取元素的文本

爬蟲程序

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import time
HOST = "https://www.zhipin.com"


def extract_data(content):
    # 處理職位列表
    job_list = []
    for item in content.find_all(class_="job-primary"):
        job_title = item.find("div", attrs={"class": "job-title"})
        job_name = job_title.a.attrs["title"]
        job_href = job_title.a.attrs["href"]
        data_jid = job_title.a['data-jid']
        data_lid = job_title.a["data-lid"]
        job_area = job_title.find(class_="job-area").text

        job_limit = item.find(class_="job-limit")
        salary = job_limit.span.text
        exp = job_limit.p.contents[0]
        degree = job_limit.p.contents[2]

        company = item.find(class_="info-company")
        company_name = company.h3.a.text
        company_type = company.p.a.text

        stage = company.p.contents[2]
        scale = company.p.contents[4]
        info_desc = item.find(class_="info-desc").text
        tags = [t.text for t in item.find_all(class_="tag-item")]
        job_list.append([job_area, company_type, company_name, data_jid, data_lid, job_name, stage, scale, job_href,
                         salary, exp, degree, info_desc, "、".join(tags)])
    return job_list


def get_job_text(driver):
    """
    通過driver.text 也可以獲取到頁面數據 但是解析不太方便
    """
    time.sleep(5)
    job_list = driver.find_elements_by_class_name("job-primary")
    for job in job_list:
        print job.text


def main(host):
    chrome_driver = "chromedriver.exe"
    driver = webdriver.Chrome(executable_path=chrome_driver)
    driver.get(host)
    # 獲取到查詢框 輸入查詢條件
    driver.find_element_by_name("query").send_keys("python")
    # 點擊查詢按鈕
    driver.find_element_by_class_name("btn-search").click()

    job_list = []
    while 1:
        time.sleep(5)
        content = driver.execute_script("return document.documentElement.outerHTML")
        content = bs(content, "html.parser")
        job_list += extract_data(content)
        next_page = content.find(class_="next")
        if next_page:
            driver.find_element_by_class_name("next").click()
        else:
            break

    driver.close()


if __name__ == "__main__":
    main(HOST)

今天就到這吧，開始寫公衆號了，老鐵們求關注

微信搜一搜關注公衆號領取更多學習資料

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Python行業分析】BOSS直聘招聘信息獲取之使用webdriver進行爬取

序

可以操作瀏覽器的模塊

WebDriver

webbrowser

Splinter

我們這次使用的是WebDriver

爬蟲程序

今天就到這吧，開始寫公衆號了，老鐵們求關注

微信搜一搜關注公衆號領取更多學習資料

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

【Redis 面試題】這些Redis都不懂，還想要offer?

【Redis數據結構 List 類型】List 類型生產中的應用消息隊列、排行榜、老乾媽的朋友圈、監控程序的實現

Python程序員的成神之路需要的幾百本書

【Redis數據結構序】使用redis-py操作Redis數據庫

【Redis數據結構 String類型】String類型生產中的應用緩存、計數器、限速器的實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結