Python之爬蟲-- etree和XPath實戰

原創

2018-08-31 13:30

下面代碼是在網站上找到的一個例子，空閒的時候可以自己調試。

# -*- coding:utf-8 -*-
""" 爬蟲 創業邦 創業公司信息爬取
網頁url = 'http://www.cyzone.cn/vcompany/list-0-0-1-0-0/0'
爬取頁面中的創業公司，融資階段，創業領域，成立時間和創業公司的鏈接信息。
使用到requests, json, codecs, lxml等庫
requests用於訪問頁面，獲取頁面的源代碼
josn庫用於寫入json文件保存到本地
codecs庫用於讀寫文件時編碼問題
lxml用於解析網頁源代碼，獲取信息
"""
import requests
import json
import codecs
from lxml import etree
 
 
class chuangYeBang:
    def __init__(self):
        pass
 
    def get_html(self, url):
        """ get_html
        得到網頁源代碼，返回unicode格式
        
        @param: url
        @return: r.text <type 'unicode'>
        """
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64)"
            "AppleWebKit/537.36 (KHTML, like Gecko)"
            "Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6721.400"
            "QQBrowser/10.2.2243.400"
        }
        r = requests.get(url, headers=headers)
        print r.status_code
        return r.text
 
    def get_company_data(self, text):
        """ get_company_data
        得到網頁信息
        eg: [{
            "companyUrl": "http://www.cyzone.cn/r/20180824/68979.html", 
            "stage": "天使輪", 
            "type": "硬件", 
            "time": "2014-12-19", 
            "companyName": "成都思科"
        }]
        @param: text 網頁的源代碼unicode格式源代碼
        @return: list 一個頁面所有的公司信息 列表中每一個元素爲存入信息的字典
        """
        html = etree.HTML(text)  # 解析網頁
        company_name_list = html.xpath(
            '//td[@class="table-company-tit"]/a/span/text()'
            )
        # 得到帶有class"table-company-tit"...屬性的td標籤下的a標籤下的span標籤的內容，返回爲一個列表
        print company_name_list  # get companyName list
        print len(company_name_list)
 
        company_url_list = html.xpath(
            '//td[@class="table-company-tit"]/a/@href'
            )
        """
        得到帶有..屬性的td標籤下的a標籤中hred的內容
        爲一個url
        <a href="http://www.cyzone.cn/r/20180823/68963.html" target="_blank">
        """
        print company_url_list
 
        stage_list = html.xpath('//td[@class="table-stage"]/@data-stage')
        # 同上 不解釋了 得到stage
        company_stage_list = []
        for company_stage in stage_list:
            company_stage = company_stage.strip(',') if company_stage else None
            company_stage_list.append(company_stage)
        print company_stage_list  # get stage list
        print len(company_stage_list)
 
        company_type_list = html.xpath('//td[@class="table-type"]')
        type_list = []
        for company_type in company_type_list:
            company_type = company_type.xpath('./a/text()')[0] \
                if company_type.xpath('./a/text()') else None
            type_list.append(company_type)
        print type_list
        print len(type_list)
 
        company_time_list = html.xpath('//td[@class="table-time"]/text()')
        print company_time_list
        print len(company_time_list)
 
        """
        遍歷每個列表，取出列表對應的元素，組成我們需要的字典
        """
        ret_company_list = []
        for i in range(20):
            single_company = {}
            single_company['companyUrl'] = company_url_list[i]
            single_company['companyName'] = company_name_list[i]
            single_company['type'] = type_list[i]
            single_company['stage'] = company_stage_list[i]
            single_company['time'] = company_time_list[i]
            ret_company_list.append(single_company)
 
        return ret_company_list
 
    def write_in_json(self, data):
        """ write_in_json
        寫入json文件
        codecs  # 用於編碼，同一用utf-8格式編碼
        json.dumps  # 方法用於將字典或者列表轉換成json字符串格式，存入json文件
        indent=2  # json文件中顯示的方法，顯示爲2字符的鎖緊
        .decode('unicode_escape')  # 在json文件中顯示中文，不會顯示utf-8編碼，方便看。
        """
        json_data = json.dumps(data, indent=2).decode('unicode_escape')
        with codecs.open('./chuangYeBang.json', 'w', 'utf-8') as fw:
            fw.write(json_data)
 
 
class getCompanyInfo:
    """ 得到每個公司詳細信息 """
    def __init__(self):
        pass
 
    def get_html_text(self, url):
        headers = {}
        r = requests.get(url, headers=headers)
        print r.status_code
        return r.text
 
    def get_company_info(self, text):
        pass
 
 
if __name__ == "__main__":
    cyb = chuangYeBang()
    url = 'http://www.cyzone.cn/vcompany/list-0-0-1-0-0/0'
    text = cyb.get_html(url)
    data = cyb.get_company_data(text)
    cyb.write_in_json(data)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python之爬蟲-- etree和XPath實戰

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

Mellanox網卡開啓SR-IOV

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

Python之配置日誌的幾種方式

Python之XML創建

Python之爬蟲準備工作

SSM+Maven+Bootstrap+MySQL實現增刪改查的一個小demo

Python之線程代替方案 - 多進程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結