百度指數爬取+pyppeteer登錄(解決旋轉驗證碼)

 

百度指數中這些折線上的點是是通過兩個字符串加密過的

其中,數據接口會返回一個data值作爲e值,和一個uniqid用作去請求t值

當得到這兩個之後會進行一個處理函數decrypt

通過帶入t和e到decrypt測試,就是我們想要的,python版如下

def decrypt_py(t,e):
    """
    :param t:
    :param e:
    :return: 解析出來的數據
    """
    a=dict()
    length=int(len(t)/2)
    for o in range(length):
        a[t[o]] = t[length + o]
    r="".join([a[each]for each in e ]).split(",")

    return r

對於省份和城市的名字是存在一個字典中來調用的

#baidu_id.py
city={1:"濟南",2:"貴陽",3:"黔南",4:"六盤水",5:"南昌",6:"九江",7:"鷹潭",8:"撫州",9:"上饒",10:"贛州",11:"重慶",13:"包頭",14:"鄂爾多斯",15:"巴彥淖爾",16:"烏海",17:"阿拉善盟",19:"錫林郭勒盟",20:"呼和浩特",21:"赤峯",22:"通遼",25:"呼倫貝爾",28:"武漢",29:"大連",30:"黃石",31:"荊州",32:"襄陽",33:"黃岡",34:"荊門",35:"宜昌",36:"十堰",37:"隨州",38:"恩施",39:"鄂州",40:"咸寧",41:"孝感",42:"仙桃",43:"長沙",44:"岳陽",45:"衡陽",46:"株洲",47:"湘潭",48:"益陽",49:"郴州",50:"福州",51:"莆田",52:"三明",53:"龍巖",54:"廈門",55:"泉州",56:"漳州",57:"上海",59:"遵義",61:"黔東南",65:"湘西",66:"婁底",67:"懷化",68:"常德",73:"天門",74:"潛江",76:"濱州",77:"青島",78:"煙臺",79:"臨沂",80:"濰坊",81:"淄博",82:"東營",83:"聊城",84:"菏澤",85:"棗莊",86:"德州",87:"寧德",88:"威海",89:"柳州",90:"南寧",91:"桂林",92:"賀州",93:"貴港",94:"深圳",95:"廣州",96:"宜賓",97:"成都",98:"綿陽",99:"廣元",100:"遂寧",101:"巴中",102:"內江",103:"瀘州",104:"南充",106:"德陽",107:"樂山",108:"廣安",109:"資陽",111:"自貢",112:"攀枝花",113:"達州",114:"雅安",115:"吉安",117:"昆明",118:"玉林",119:"河池",123:"玉溪",124:"楚雄",125:"南京",126:"蘇州",127:"無錫",128:"北海",129:"欽州",130:"防城港",131:"百色",132:"梧州",133:"東莞",134:"麗水",135:"金華",136:"萍鄉",137:"景德鎮",138:"杭州",139:"西寧",140:"銀川",141:"石家莊",143:"衡水",144:"張家口",145:"承德",146:"秦皇島",147:"廊坊",148:"滄州",149:"溫州",150:"瀋陽",151:"盤錦",152:"哈爾濱",153:"大慶",154:"長春",155:"四平",156:"連雲港",157:"淮安",158:"揚州",159:"泰州",160:"鹽城",161:"徐州",162:"常州",163:"南通",164:"天津",165:"西安",166:"蘭州",168:"鄭州",169:"鎮江",172:"宿遷",173:"銅陵",174:"黃山",175:"池州",176:"宣城",177:"巢湖",178:"淮南",179:"宿州",181:"六安",182:"滁州",183:"淮北",184:"阜陽",185:"馬鞍山",186:"安慶",187:"蚌埠",188:"蕪湖",189:"合肥",191:"遼源",194:"松原",195:"雲浮",196:"佛山",197:"湛江",198:"江門",199:"惠州",200:"珠海",201:"韶關",202:"陽江",203:"茂名",204:"潮州",205:"揭陽",207:"中山",208:"清遠",209:"肇慶",210:"河源",211:"梅州",212:"汕頭",213:"汕尾",215:"鞍山",216:"朝陽",217:"錦州",218:"鐵嶺",219:"丹東",220:"本溪",221:"營口",222:"撫順",223:"阜新",224:"遼陽",225:"葫蘆島",226:"張家界",227:"大同",228:"長治",229:"忻州",230:"晉中",231:"太原",232:"臨汾",233:"運城",234:"晉城",235:"朔州",236:"陽泉",237:"呂梁",239:"海口",241:"萬寧",242:"瓊海",243:"三亞",244:"儋州",246:"新餘",253:"南平",256:"宜春",259:"保定",261:"唐山",262:"南陽",263:"新鄉",264:"開封",265:"焦作",266:"平頂山",268:"許昌",269:"永州",270:"吉林",271:"銅川",272:"安康",273:"寶雞",274:"商洛",275:"渭南",276:"漢中",277:"咸陽",278:"榆林",280:"石河子",281:"慶陽",282:"定西",283:"武威",284:"酒泉",285:"張掖",286:"嘉峪關",287:"台州",288:"衢州",289:"寧波",291:"眉山",292:"邯鄲",293:"邢臺",295:"伊春",297:"大興安嶺",300:"黑河",301:"鶴崗",302:"七臺河",303:"紹興",304:"嘉興",305:"湖州",306:"舟山",307:"平涼",308:"天水",309:"白銀",310:"吐魯番",311:"昌吉",312:"哈密",315:"阿克蘇",317:"克拉瑪依",318:"博爾塔拉",319:"齊齊哈爾",320:"佳木斯",322:"牡丹江",323:"雞西",324:"綏化",331:"烏蘭察布",333:"興安盟",334:"大理",335:"昭通",337:"紅河",339:"曲靖",342:"麗江",343:"金昌",344:"隴南",346:"臨夏",350:"臨滄",352:"濟寧",353:"泰安",356:"萊蕪",359:"雙鴨山",366:"日照",370:"安陽",371:"駐馬店",373:"信陽",374:"鶴壁",375:"周口",376:"商丘",378:"洛陽",379:"漯河",380:"濮陽",381:"三門峽",383:"阿勒泰",384:"喀什",386:"和田",391:"亳州",395:"吳忠",396:"固原",401:"延安",405:"邵陽",407:"通化",408:"白山",410:"白城",417:"甘孜",422:"銅仁",424:"安順",426:"畢節",437:"文山",438:"保山",456:"東方",457:"阿壩",466:"拉薩",467:"烏魯木齊",472:"石嘴山",479:"涼山",480:"中衛",499:"巴音郭楞",506:"來賓",514:"北京",516:"日喀則",520:"伊犁",525:"延邊",563:"塔城",582:"五指山",588:"黔西南",608:"海西",652:"海東",653:"克孜勒蘇柯爾克孜",654:"天門仙桃",655:"那曲",656:"林芝",657:"None",658:"防城",659:"玉樹",660:"伊犁哈薩克",661:"五家渠",662:"思茅",663:"香港",664:"澳門",665:"崇左",666:"普洱",667:"濟源",668:"西雙版納",669:"德宏",670:"文昌",671:"怒江",672:"迪慶",673:"甘南",674:"陵水黎族自治縣",675:"澄邁縣",676:"海南",677:"山南",678:"昌都",679:"樂東黎族自治縣",680:"臨高縣",681:"定安縣",682:"海北",683:"昌江黎族自治縣",684:"屯昌縣",685:"黃南",686:"保亭黎族苗族自治縣",687:"神農架",688:"果洛",689:"白沙黎族自治縣",690:"瓊中黎族苗族自治縣",691:"阿里",692:"阿拉爾",693:"圖木舒克"}
province={901:"山東",902:"貴州",903:"江西",904:"重慶",905:"內蒙古",906:"湖北",907:"遼寧",908:"湖南",909:"福建",910:"上海",911:"北京",912:"廣西",913:"廣東",914:"四川",915:"雲南",916:"江蘇",917:"浙江",918:"青海",919:"寧夏",920:"河北",921:"黑龍江",922:"吉林",923:"天津",924:"陝西",925:"甘肅",926:"新疆",927:"河南",928:"安徽",929:"山西",930:"海南",931:"臺灣",932:"西藏",933:"香港",934:"澳門"}

這個省份和城市可以通過js文件獲取,點擊人羣畫像時候在network中搜索一個地名,會查到一個js文件,點進去之後再次進行查詢,就有好多好多城市了。

然後就可以動手了,需要登錄一下取到遊覽器中的cookie, 

已經根據接口更改修改了,5.20

然後看見有的小夥伴看見接口變了就不知道怎麼做,推薦一個編碼轉換的網站,可以把它先解碼,就會容易得多

http://tool.chinaz.com/tools/urlencode.aspx

import requests
import datetime
from utils.baidu_id import province, city


def getIndex(word="我和我的祖國"):
    """
        搜索指數
        :param word:
        :return:
        """
    insert_word = """[[{"name":"%s","wordType":1}]]""" % word
    url = f"http://index.baidu.com/api/SearchApi/index?word={insert_word}&area=0&days=30"
    rep_json = get_rep_json(url)
    generalRatio = rep_json['data']['generalRatio']
    uniqid = rep_json['data']['uniqid']
    all_index_e = rep_json['data']['userIndexes'][0]['all']['data']
    pc_index_e = rep_json['data']['userIndexes'][0]['pc']['data']
    wise_index_e = rep_json['data']['userIndexes'][0]['wise']['data']
    t = getPtbk(uniqid)
    startDate = rep_json['data']['userIndexes'][0]['wise']['startDate']
    all_news = getTopNews(decrypt_py(t, all_index_e), startDate, word)
    pc_news = getTopNews(decrypt_py(t, pc_index_e), startDate, word)
    wise_news = getTopNews(decrypt_py(t, wise_index_e), startDate, word)
    for each in (all_news, pc_news, wise_news):
        print(each)
    return None


def getFeedIndex(word="我和我的祖國"):
    """
    :param word: 關鍵詞
    :return: 資訊指數
    """
    insert_word="""[[{"name":"%s","wordType":1}]]"""%word
    url = "http://index.baidu.com/api/FeedSearchApi/getFeedIndex?word=%s&area=0&days=30" % insert_word
    feed_index_data = get_rep_json(url)
    uniqid = feed_index_data['data']['uniqid']
    data = feed_index_data["data"]['index'][0]
    generalRatio = data['generalRatio']  # 資訊指數概覽
    e = data['data']
    t = getPtbk(uniqid)

    return decrypt_py(t, e)


def getNewsDate(word="我和我的祖國"):
    """
    :param word:
    :return: 媒體指數的峯頂新聞
    """
    insert_word = """[[{"name":"%s","wordType":1}]]""" % word
    url = f"http://index.baidu.com/api/NewsApi/getNewsIndex?area=0&word={insert_word}&days=30"
    res_json = get_rep_json(url)['data']

    generalRatio = res_json["index"][0]['generalRatio']
    e = res_json['index'][0]['data']
    start_date = res_json['index'][0]['startDate']
    t = getPtbk(res_json['uniqid'])

    news = getTopNews(decrypt_py(t, e), start_date, word)

    return news


def getTopNews(numList: list, start_date, word):
    """
    找到當前指數列表中的峯值
    轉換成日期字符串
    將合成的日期字符串帶入到請求數據接口中
    返回新聞數據
    :param numList: 指數列表
    :param start_date: 起始日期
    :param word:
    :return: 峯值新聞
    """
    start_date = string_toDatetime(start_date)
    hill_tops = getHilltop(numList)
    hill_tops_date = [datetime_toString(start_date + datetime.timedelta(days=index)) for index in hill_tops]
    news = getNews(",".join(hill_tops_date), word)["data"][word]

    return news


def getNews(dts, word):
    """
    獲取媒體指數接口數據
    :param dts:用,連接的時間字符串,例:dts=2019-10-06,2019-10-10,2019-10-12,2019-10-16,2019-10-21,2019-10-24
    :param word:
    :return:接口傳回的數據
    """
    url = f"http://index.baidu.com/api/NewsApi/checkNewsIndex?dates[]={dts}&type=day&words={word}"
    return get_rep_json(url)


def getHilltop(numList: list):
    """
    :param numList:一組數值數組
    :return:峯值的序號列表
    """
    numList = list(map(lambda x: float(x) if x else 0, numList))
    hillTops = [index for index, each in enumerate(numList) if
                index and index < len(numList) - 1 and each > numList[index - 1] and each > numList[index + 1]]

    return hillTops


def getMulti(word="我和我的祖國"):
    """需求圖譜
    pv搜索熱度;ratio搜索變化率;sim相關性
    """
    url = f"http://index.baidu.com/api/WordGraph/multi?wordlist%5B%5D={word}"
    word_data = get_rep_json(url)['data']['wordlist'][0]
    if word_data['keyword']:
        print(word_data['wordGraph'])


def getRegion(word="我和我的祖國", startDate='2019-09-17', endDate='2019-10-17'):
    """地域分佈"""
    url = f"http://index.baidu.com/api/SearchApi/region?region=0&word={word}&startDate={startDate}&endDate={endDate}"
    region = get_rep_json(url)['data']['region'][0]
    region_city = [{'city': city[int(city_n)], 'number': region['city'][city_n]} for city_n in region['city']]
    region_prov = [{'prov': province[int(prov_n)], 'number': region['prov'][prov_n]} for prov_n in region['prov']]
    print(region_city, region_prov)


def getBaseAttributes(word="我和我的祖國"):
    """人羣屬性"""
    url = f"http://index.baidu.com/api/SocialApi/baseAttributes?wordlist[]={word}"
    rep_data = get_rep_json(url)['data']['result']
    return rep_data


def getInterest(word="我和我的祖國"):
    """興趣分佈"""
    url = f"http://index.baidu.com/api/SocialApi/interest?wordlist[]={word}"
    rep_data = rep_data = get_rep_json(url)['data']['result']
    return rep_data


def string_toDatetime(string):
    # 把字符串轉成datetime
    return datetime.datetime.strptime(string, "%Y-%m-%d")


def datetime_toString(dt):
    # 把datetime轉成字符串
    return dt.strftime("%Y-%m-%d")


def getPtbk(uniqid):
    url = f"http://index.baidu.com/Interface/ptbk?uniqid={uniqid}"
    return get_rep_json(url)['data']


def decrypt_py(t, e):
    """
    :param t:
    :param e:
    :return: 解析出來的數據
    """
    a = dict()
    length = int(len(t) / 2)
    for o in range(length):
        a[t[o]] = t[length + o]
    r = "".join([a[each] for each in e]).split(",")
    print(r)

    return r


def get_rep_json(url):
    """
    獲取json
    :param url: 請求接口
    :return:
    """
    hearder = {
        "Cookie": '',  # 請填寫遊覽器中的cookie
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
    }
    response = requests.get(url, headers=hearder)
    response_data = response.json()
    # print(response_data)
    return response_data


def main():
    getFeedIndex()
    getNewsDate()
    getIndex()
    getRegion()
    getBaseAttributes()
    getInterest()


if __name__ == "__main__":
    main()

還有主題的,接口找了一下,東西都一樣,有興趣可以自己搞一下 

主題:

主題搜索指數:http://insight.baidu.com/base/search/trend/general?id=23734&dateType=30&filterType=1&source=0

&filterType=1&source=1#pc

&source=2#移動

主題資訊和主題視頻

"http://index.baidu.com/Interface/Newwordgraph/getTopicFeed?nodeid=23935";"http://index.baidu.com/Interface/api/ptbkTopic?uniqid=5dad242a566a46.43359139";;;;"/api/videoIndex/getVideoIndex?nodeid=23935";"http://index.baidu.com/Interface/api/ptbkTopic?uniqid=5dad242a71d612.53363283"

品牌關注

http://insight.baidu.com/base/search/topic/attentionBrand?id=23734

搜索地域分佈:http://insight.baidu.com/base/search/region/general?id=23734&dateType=30&filterType=1&pageSize=40

人羣屬性:

http://insight.baidu.com/base/search/Topic/baseAttributes?nodeid=23734

興趣分佈:

http://insight.baidu.com/base/search/Topic/interest?nodeid=23734&typeid=

模擬登錄完成旋轉驗證碼

現在的我已經不是從前的我了,現在的我已經可以完成它了。

世界上沒有爬不過去的山,如果有,那麼可以站在巨人的肩膀上,再爬一次。

我來了我來了,我帶着模型走來了,同學們你們是否還在爲旋轉驗證碼而苦惱,從現在開始你可以換個苦惱的問題了!!!

來來來,看成果

怎麼樣,是不是很快樂,因爲這篇篇幅已經挺長不夠我輸出彩虹屁了,所以我寫到另外一篇博客了

旋轉拖動驗證碼解決方案

有什麼不對的地方還是希望同學們能指出來!好嘞,快樂就完事了!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章