爬蟲:根據表格中專利號對應的超鏈接爬取網頁pdf

這個是公司同事幫我寫的,沒接觸過爬蟲,代碼存個檔學習一下。

這個文件是讀取表格單元格中的專利號和對應的超鏈接,存儲到新的csv文件中



import pandas as pd
from openpyxl import load_workbook

filename = '/data/datasets/LLMS/20231208 湖南中煙專利檢索結果及全文(去噪前)/加熱捲菸煙支結構專利風險評估項目/煙支結構專利檢索結果(電腦連接外網狀態下點擊“公開號”可打開專利全文鏈接).XLSX'
df = pd.read_excel(filename)

workbook = load_workbook(filename)

sheets = workbook.sheetnames

sheet_name = sheets[0]

sheet = workbook[sheet_name]

ids = []
links = []
for row in sheet.iter_rows():
    for cell in row:
        if cell.hyperlink:
            # print(row[1].value)
            # print(cell.hyperlink.target)
            ids.append(row[1].value)
            links.append(cell.hyperlink.target)

# 生成DataFrame
data = {'ID': ids, 'Link': links}
df = pd.DataFrame(data)

# 保存爲CSV文件
df.to_csv('output.csv', index=False)

這段就是爬蟲代碼了,聽同事說僞裝瀏覽器是對付反爬的一種很有效的方法,然後就是sleep一下等待網頁響應了。

import pandas as pd
import requests
import chardet
import json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import re
import time
import urllib.request
import os


data = pd.read_csv('/data/python projects/image_tool/notebooks_dev/llms/專利檢索/links.csv')
path = '/data/datasets/LLMS/20231208 湖南中煙專利檢索結果及全文(去噪前)/加熱捲菸煙支結構專利風險評估項目/files'

# 打開真實谷歌瀏覽器,防止識別
chrome_options = Options()
# 在終端運行以下命令,啓動接口爲9221的爬蟲瀏覽器
# !chrome --remote-debugging-port=9221 --user-data-dir="/data0/home/aimall/others/google"
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
service = Service("chromedriver_linux64/chromedriver-linux64/chromedriver") # 版本要對應,到https://sites.google.com/chromium.org/driver/downloads/version-selection下載
driver = webdriver.Chrome(service=service,options=chrome_options)

pattern = r'https?://[^\s"]+'

if __name__ == '__main__':
    for row in data.iterrows():
        id = row[1][0] +'.pdf'
        url = row[1][1]
        url = url.replace('abst','pdf')
        driver.get(url)
        time.sleep(3) # 注意等待網頁響應
        page_source = driver.page_source
        urls = re.findall(pattern, page_source)
        pdf_url = None
        for url in urls:
            if url.startswith('https://patsnap-pdf.cdn.zhihuiya.com'):
                pdf_url = url
        if pdf_url is not None:
            pdf_url = pdf_url.replace('&', '&')
            response = requests.get(pdf_url)
            with open(os.path.join(path,str(id)), "wb") as file:
                file.write(response.content)
        time.sleep(2)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章