搜狗微信公衆號文章反爬蟲完美攻克

很簡單,selenium + chromedriver,搜狗的部分直接在chrome模擬瀏覽器內部操作即可,而mp.weixin.qq.com則是騰訊的了,不反爬蟲,用urllib requests等等即可。

需要掃碼登陸,不掃碼只能採取10頁數據

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import threading

driver = webdriver.Chrome()
driver.get("http://weixin.sogou.com/")
driver.find_element_by_xpath('//*[@id="loginBtn"]').click()

find = input("輸入你想查找的關鍵詞")
driver.find_element_by_xpath('//*[@id="query"]').send_keys("%s"%find)
driver.find_element_by_xpath('//*[@id="searchForm"]/div/input[3]').click()
time.sleep(2)

url_list = []
while True:
    page_source = driver.page_source
    #print(page_source)
    bs_obj = BeautifulSoup(page_source,"html.parser")
    one_url_list = bs_obj.findAll("div",{"class":"txt-box"})
    for url in one_url_list:
        url_list.append(url.h3.a.attrs['href'])
        #print(url.h3.a.attrs['href'])
    next_page = "http://weixin.sogou.com/weixin" + bs_obj.find("a",{"id":"sogou_next"}).attrs['href']
    driver.get(next_page)
    time.sleep(1)

def get_img(url,num,connect,cursor):
    response = requests.get(url,headers = header).content
    content = str(response,encoding = "utf-8")
    bs_obj = BeautifulSoup(content,"html.parser")
    img_list = bs_obj.findAll("img")
    count = 0
    for img in img_list:
        try:
            imgurl=get_total_url(img.attrs["data-src"])
            store_name = "%s"%url_num+"%s"%count
            path = r"C:\Users\Mr.Guo\Pictures\weixin"
            check_mkdir(path)
            urllib.request.urlretrieve(imgurl,r"C:\Users\Mr.Guo\Pictures\weixin\%s.jpeg" %store_name)
            insert_into_table(connect,cursor,store_name,html)
            count += 1
        except Exception as e:
            pass

for url_num in range(len(url_list)):
        t = threading.Thread(target = get_img,args = (url_list[url_num],url_num,connect,cursor,))
        t.start()
    


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章