Python下 selenium + GraphQuery 採集小例

原創

待鸣

2019-06-28 03:02

現在反爬措施日新月異，爬蟲技術也道高一尺魔高一丈，經歷了IP封禁、js防爬等防禦手段，總結了一套還算是不錯的採集組合

GraphQuery: https://github.com/storyicon/graphquery

國內能查到的資料貌似不多，但是功能還是很強的，用一種類似於接口請求的方式去獲取所需的格式化數據。

selenium：

控制 chorme ，操作瀏覽器的一個工具；

基本原理是：

selenium獲取到數據，帶着想要的數據接口，直接轉給 GraphQuery ，獲取到想要的數據。

貼個代碼，寫的比較簡單，就是個臨時的小工具：

import requests
import json
from selenium import webdriver
# 先把所有的內容鏈接搞回來
# url = "http://yjj.sh.gov.cn/XingZhengChuFa/xxgk2.aspx?pu=&qymc=&slrqstart=&slrqend=&pageindex=1&pagesize=100"

driver_path = r"G:\chromedriver_win32\chromedriver.exe"
opt = webdriver.ChromeOptions()

opt.add_argument('--headless')
opt.add_argument('--disable-gpu')

driver = webdriver.Chrome(executable_path=driver_path, options=opt)



res = []
# 傳入鏈接
def GraphQuery(document, expr):
    response = requests.post("http://127.0.0.1:8559", data={
        "document": document,
        "expression": expr,
    })
    return response.text

def go(url):
    driver.get(url)
    content = driver.page_source
    conseq = GraphQuery(content, r"""
                    {
                        url `css("table a")` [u `attr("href")`]
                    }
                """)
    count = json.loads(conseq)
    #print(count)
    # 把內容鏈接循環遍歷打開
    for i in count['data']['url']:
        u = 'http://yjj.sh.gov.cn/XingZhengChuFa/'+i
        # 檢查頁面內是否存在 色拉 沙拉 相關數據
        i = requests.get(u)
        content = i.content.decode("UTF-8")
        keys = ["沙拉", "色拉"]
        for key in keys:
            if key in content:
                # 存在則記錄 鏈接
                res.append(u)
                f1 = open('test.txt', 'a+')
                f1.write(u+'\n')

                f1.close()


i = 100
while i < 620:
    url = "http://yjj.sh.gov.cn/XingZhengChuFa/xxgk2.aspx?pu=&qymc=&slrqstart=&slrqend=&pageindex="+i.__str__()+"&pagesize=50"
    try:
        i += 1
        print(i)
        go(url)
        print(res)
    except:
        continue

print(res)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python下 selenium + GraphQuery 採集小例

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

VUE問題：You are using the runtime-only build of Vue where the template compiler is not available.

thinkphp5下redis記錄日誌

PHP GD庫文字生成圖片及圖片拼接

質量與規範，敬我們那些年欠下的技術債

Win Server 2016下wamp外網或局域網報403

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結