爬取鬥魚--scrapy

總體流程:

爬取該網頁所有的主播名,主播房間號,主播劇場名稱,主播的房間的url路徑 

首先打開https://www.douyu.com/directory

解析javascript裏面變量

 得到cate2Id的值,進行拼接

https://www.douyu.com/gapi/rkc/directory/2_+cate2Id

然後進行爬取,獲取的response爲json數據,json解析獲取。

部分spider代碼如下:


class douyuspider(Spider):
    name="douyuspider"
    allowed_domain=["douyu.com"]
    # start_urls =['https://www.douyu.com/gapi/rkc/directory/2_1/0', 'https://www.douyu.com/gapi/rkc/directory/2_1/1', 'https://www.douyu.com/gapi/rkc/directory/2_1/2', 'https://www.douyu.com/gapi/rkc/directory/2_1/3', 'https://www.douyu.com/gapi/rkc/directory/2_1/4', 'https://www.douyu.com/gapi/rkc/directory/2_1/5', 'https://www.douyu.com/gapi/rkc/directory/2_1/6', 'https://www.douyu.com/gapi/rkc/directory/2_1/7', 'https://www.douyu.com/gapi/rkc/directory/2_1/8', 'https://www.douyu.com/gapi/rkc/directory/2_1/9', 'https://www.douyu.com/gapi/rkc/directory/2_1/10', 'https://www.douyu.com/gapi/rkc/directory/2_1/11', 'https://www.douyu.com/gapi/rkc/directory/2_1/12', 'https://www.douyu.com/gapi/rkc/directory/2_1/13', 'https://www.douyu.com/gapi/rkc/directory/2_1/14', 'https://www.douyu.com/gapi/rkc/directory/2_1/15', 'https://www.douyu.com/gapi/rkc/directory/2_1/16', 'https://www.douyu.com/gapi/rkc/directory/2_1/17', 'https://www.douyu.com/gapi/rkc/directory/2_1/18', 'https://www.douyu.com/gapi/rkc/directory/2_1/19', 'https://www.douyu.com/gapi/rkc/directory/2_1/20', 'https://www.douyu.com/gapi/rkc/directory/2_1/21', 'https://www.douyu.com/gapi/rkc/directory/2_1/22', 'https://www.douyu.com/gapi/rkc/directory/2_1/23', 'https://www.douyu.com/gapi/rkc/directory/2_1/24', 'https://www.douyu.com/gapi/rkc/directory/2_1/25', 'https://www.douyu.com/gapi/rkc/directory/2_1/26', 'https://www.douyu.com/gapi/rkc/directory/2_1/27', 'https://www.douyu.com/gapi/rkc/directory/2_1/28', 'https://www.douyu.com/gapi/rkc/directory/2_1/29']
    # start_urls = ['https://www.douyu.com/gapi/rkc/directory/2_270/1', 'https://www.douyu.com/gapi/rkc/directory/2_270/2', 'https://www.douyu.com/gapi/rkc/directory/2_270/3', 'https://www.douyu.com/gapi/rkc/directory/2_270/4', 'https://www.douyu.com/gapi/rkc/directory/2_270/5', 'https://www.douyu.com/gapi/rkc/directory/2_270/6', 'https://www.douyu.com/gapi/rkc/directory/2_270/7', 'https://www.douyu.com/gapi/rkc/directory/2_270/8', 'https://www.douyu.com/gapi/rkc/directory/2_270/9', 'https://www.douyu.com/gapi/rkc/directory/2_270/10', 'https://www.douyu.com/gapi/rkc/directory/2_270/11', 'https://www.douyu.com/gapi/rkc/directory/2_270/12', 'https://www.douyu.com/gapi/rkc/directory/2_270/13', 'https://www.douyu.com/gapi/rkc/directory/2_270/14', 'https://www.douyu.com/gapi/rkc/directory/2_270/15', 'https://www.douyu.com/gapi/rkc/directory/2_270/16', 'https://www.douyu.com/gapi/rkc/directory/2_270/17', 'https://www.douyu.com/gapi/rkc/directory/2_270/18', 'https://www.douyu.com/gapi/rkc/directory/2_270/19', 'https://www.douyu.com/gapi/rkc/directory/2_270/20', 'https://www.douyu.com/gapi/rkc/directory/2_270/21', 'https://www.douyu.com/gapi/rkc/directory/2_270/22', 'https://www.douyu.com/gapi/rkc/directory/2_270/23', 'https://www.douyu.com/gapi/rkc/directory/2_270/24', 'https://www.douyu.com/gapi/rkc/directory/2_270/25']
    # 類別的url路徑,在該返回的值中找到各個類別的url
    start_urls =[ 'https://www.douyu.com/directory']
    def parse(self,response):
        if response.status == 200 :
            hrefs = response.body
            # js變量中提取出cate2Id進行拼接
            soup = BeautifulSoup(hrefs, "lxml")
            jsons = re.findall('{"cate2Name":.*?"isDisplay":[0,1]}',soup.text)
            cate1_urls = []
            # 拿到所有類別的url
            for json1 in jsons:
                print(json1)
                cate1_urls.append('https://www.douyu.com/gapi/rkc/directory/2_' + str(json.loads(json1).get('cate2Id')))
            # 拿到所有類別的每一頁,由於不確定頁數,默認取50
            for url in cate1_urls:
                for i in range(0,50):

                    yield Request(url+'/'+str(i),callback=self.parsePage,dont_filter=True)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章