【python】多進程+多線程製作智聯招聘爬蟲寫入CSV+mongodb

前期準備：

這次爬蟲用的都是python自帶的包，所以只用準備一個pymongo用於mongodb數據庫連接就可以了

pip install pymongo

第一步：目標站點分析

url = ‘http://sou.zhaopin.com/jobs/searchresult.ashx?p=0&jl=%E5%85%A8%E5%9B%BD&kw=%E5%A4%A7%E6%95%B0%E6%8D%AE&isadv=0’

觀察上面url 可以發現智聯招聘用4個參數定義了搜索的範圍其中 P 爲 page 也就是頁碼。jl 爲地域範圍這裏是全國，KW 爲搜索內容這裏我搜索的是關於大數據方面的，isadv這個我沒去研究。不過對搜索沒有影響。

所以，我們可以通過這個定製搜索的內容來得到索引頁的url ：

def get_page_index(jl,keyword,page=1):
    data = {
    'jl':jl,
    'kw':keyword,
    'p':page,
    'isadv': 0
    }
    url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?'+ urlencode(data)
    print('[+] 到達索引頁: %s' % url)
    return download_html(url)

再設置一個confi腳本，將其引入爬取智聯招聘.py文件中。

from config import *

第二步：解析索引頁中所有職位的url

智聯招聘的每個職位url：

示例：http://jobs.zhaopin.com/208149611251243.htm

http://jobs.zhaopin.com/407340480250023.htm

http://jobs.zhaopin.com/390314230250020.htm

觀察上面職位url，可以發現都是由http://jobs.zhaopin.com/+15位數字+.htm 組成

通過這個，我們可以用re模塊將頁面所有的url連接解析出來，然後再匹配滿足上面示例形式的url。

具體代碼如下：

def get_index_href(html):
    pattern = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    links = pattern.findall(html)
    href = 'http://jobs.zhaopin.com/\d{15}.htm'
    for link in links:  # 取到所有的a標籤中的鏈接，
        if re.match(href, link) :  # 進行匹配
            yield link
    return None

第三步：進入職位的url抓取職位信息

我們需要的是上面的這些信息。

查看網頁源代碼，智聯招聘對爬蟲來說很友好，每個我們需要的內容都包含在<strong>標籤下，我們很容易通過正則表達式把所有的需要的信息抓取下來。這裏推薦大家一個在線的正則表達式網站：在線正則表達式測試

我用來匹配的表達式：

pattern = '<li><span>[\u4e00-\u9fa5 ：]+</span>.*?>([\u4e00-\u9fa5 :\w\s/-]+).*>'

tittle_pattern = '<h1>([\u4e00-\u9fa5 :+、()\.\w\s/-]+.*?)</h1>'

第四步：循環

循環其實在爬蟲中是很好理解和琢磨的，所以這裏就不再介紹了。最後我們把爬下來的數據寫入mongodb和csv文件中，用在後續的機器學習數據分析。

csv文件的寫入示例：

import csv
with open("name.csv", "w") as csvfile:
 # 寫標題
   fileheader = ["name", "score"]
    dict_writer = csv.DictWriter(csvFile, fileheader)
 # 寫數據名，可以自己寫如下代碼完成：
    dict_writer.writerow(dict(zip(fileheader, fileheader)))
 # 之後，按照（屬性：數據）的形式，將字典寫入CSV文檔即可
    dict_writer.writerow({"name": "Li", "score": "80"})

第五步：引入多線程與多進程

寫模塊的時候，一般多進程和多線程用在 if __name__ == '__main__'之後，前後對比一下，因爲python的線程鎖GIL，其實多線程並不會有很好的效果，不過對於多核的處理器多進程確實會提高很多效率。在python中多線程和多進程使用方法很簡單。

    #多線程
    pool = multiprocessing.Pool()
    # 多進程
    thread = threading.Thread(target=pool.map,args = (main,[x for x in range(1, 100)]))
    thread.start()
    thread.join()

實例代碼：

import csv,re,threading,multiprocessing
from urllib.parse import urlencode
from urllib.request import urlopen,Request,urlparse,build_opener,install_opener
from  urllib.error import URLError,HTTPError
import pymongo as pymongo
from config import *

def download_html(url):
    headers = {'User-Agent': "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"}
    request = Request(url, headers=headers)
    try:
        html = urlopen(request).read().decode()
    except HTTPError as e:
        html = None
        print('[W] 下載出現服務器錯誤: %s' % e.reason)
        return None
    except URLError as e:
        html = None
        print("[E] 站點不可達: %s" % e.reason)
        return None
    return html

def get_page_index(jl,keyword,page=1):
    data = {
    'jl':jl,
    'kw':keyword,
    'p':page,
    'isadv': 0
    }
    url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?'+ urlencode(data)
    print('[+] 到達索引頁: %s' % url)
    return download_html(url)

def get_index_href(html):
    pattern = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    links = pattern.findall(html)
    href = 'http://jobs.zhaopin.com/\d{15}.htm'
    for link in links:  # 取到所有的a標籤中的鏈接，
        if re.match(href, link) :  # 進行匹配
            yield link
    return None

def page_parser(html,fileheader):
    pattern = '<li><span>[\u4e00-\u9fa5 ：]+</span>.*?>([\u4e00-\u9fa5 :\w\s/-]+).*>'
    data = re.findall(pattern,html)


    try:
        tittle_pattern = '<h1>([\u4e00-\u9fa5 :+、()\.\w\s/-]+.*?)</h1>'
        tittle_find = re.search(tittle_pattern, html)
        tittle = tittle_find.group(1)
        data.insert(0, tittle)
    except:
        print('職業名稱未找到')
        data.insert(0,'大數據職位')
    clear_data = dict(zip(fileheader,data))
    return clear_data

def writer_to_mongodb(res):
    if db[MONGOTABE].insert(res):
        print('herf= {} save success'.format(url))
        return True
    print('save fail')
    return False

def write_csv_header(fileheader):
    with open("智聯招聘.csv", "w",newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fileheader)
        writer.writeheader()

def main(page):
    fileheader = ['職位名稱','職位月薪', '工作地點', '發佈日期', '工作性質', '工作經驗', '最低學歷', '招聘人數', '職位類別', '公司規模', '公司性質', '公司行業']
    html = get_page_index(jl,keyword,page)
    for link in get_index_href(html):
        print('[+] 找到目標站點： %s' % link)
        parser_html = download_html(link)
        dict = page_parser(parser_html,fileheader)

        #寫入數據庫 mongodb
        #writer_to_mongodb(dict)

        with open("智聯招聘.csv", "a",newline='') as csvfile:
            print('    正在寫入csv文件中.....')
            writer = csv.DictWriter(csvfile, fieldnames=fileheader)
            writer.writerow(dict)

if __name__ == '__main__':
    client = pymongo.MongoClient(MONGO_URL)
    db = client[MONGODB]
    fileheader = ['職位名稱','職位月薪', '工作地點', '發佈日期', '工作性質', '工作經驗', '最低學歷', '招聘人數', '職位類別', '公司規模', '公司性質', '公司行業']
    write_csv_header(fileheader)
    #多線程
    pool = multiprocessing.Pool()
    # 多進程
    thread = threading.Thread(target=pool.map,args = (main,[x for x in range(1, 100)]))
    thread.start()
    thread.join()

效果展示：

【python】多進程+多線程製作智聯招聘爬蟲寫入CSV+mongodb

【支持向量機SVM】算法原理公式推導 python編程實現

【TextRank】關鍵詞提取算法原理公式推導源碼分析

【邏輯迴歸LR】算法原理公式推導 python編程實現

【決策樹DT】算法原理公式推導 python編程實現

【word2vec】算法原理公式推導

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

【python】多進程+多線程 製作智聯招聘爬蟲 寫入CSV+mongodb

【python】多進程+多線程製作智聯招聘爬蟲寫入CSV+mongodb