scrapy專利爬蟲（三）——簡單實際操作

確定鏈接

在chrome中打開審查元素中的network選項，查看查詢專利時發送的請求。觀察後發現在每次查詢的時候，瀏覽器都會先發送兩條請求給服務器。

發送相關請求

經過觀察發現，網站的查詢流程是

先發送不帶參數的post請求preExecuteSearch!preExcuteSearch.do將ip地址傳給服務器
然後再發送biaogejsAC!executeCommandSearchUnLogin.do將查詢參數發給服務器

填寫表單，發送請求

這裏只給出一個簡單的例子，具體實現見github或代碼附件

headers = {
    "Content-Type": "application/x-www-form-urlencoded"
}
searchExp = SearchService.getCnSearchExp(self.startDate, proposer, inventor, type)
formData = {
    "searchCondition.searchExp": searchExp,
    "searchCondition.dbId": "VDB",
    "searchCondition.searchType": "Sino_foreign",
    "searchCondition.power": "false",
    "wee.bizlog.modulelevel": "0200201",
    "resultPagination.limit": BaseConfig.CRAWLER_SPEED
}
yield FormRequest(
    url="http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/biaogejsAC!executeCommandSearchUnLogin.do",
    callback=self.parsePatentList,
    method="POST",
    headers=headers,
    formdata=formData,
    meta={
        'searchExp': searchExp,
        'inventionType': type,
        'startDate': self.startDate,
        'proposer': proposer,
        'inventor': inventor
    }
)

數據解析

通過觀察chrome的Element，可以逐個找出我們所需要的元素，例如：

本工程使用beautifulsoup進行解析，對於帶class的元素，使用find(attrs={"class": "className"})的方法採集即可，其他參數也類似。這裏提供簡單的例子

itemSoup = BeautifulSoup(item.prettify(), "lxml")
header = itemSoup.find(attrs={"class": "item-header"})
pi['name'] = header.find("h1").get_text(strip=True)
pi['type'] = header.find(attrs={"class": "btn-group left clear"}).get_text(strip=True)
pi['patentType'] = QueryInfo.inventionTypeToString(type)
content = itemSoup.find(attrs={"class": "item-content-body left"})

數據收集

同樣的需要對item使用yield，然後將數據傳入pipeline中進行處理，關於更多數據處理的詳細內容將會在下節內容中介紹。

源碼下載

csdn
github

讚賞

微信	支付寶

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy專利爬蟲（三）——簡單實際操作

scrapy專利爬蟲（三）——簡單實際操作

確定鏈接

發送相關請求

填寫表單，發送請求

數據解析

數據收集

源碼下載

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

熱力地圖開發

關於vue組件的destroy和事件傳遞的一些問題

jenkins與django的持續集成

django生產環境部署

django測試——關於登錄態

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結