python爬蟲模擬登陸Gethub並進行搜索

原創

2019-07-30 13:56

1. 目標

以Github爲例實現模擬登陸的過程，同時爬取登錄後纔可以訪問的頁面信息，如好友動態、個人信息。登錄後可以看到這些信息，退出後就看不到這些信息了。

2. 環境準備

安裝好lxml和requests庫。

3. 分析登陸過程

1 先退出登錄，同時清除Cookies

2 打開https://github.com/login，用Google開發者工具進行登錄抓包

3 點擊登錄後的抓包見下圖：

Header中包括Cookies、Host、Origin、Referer、User-Agent等,帶着頭文件訪問登陸頁面

class Login(object):
    def __init__(self):
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Connection': 'keep-alive',
            'Host': 'github.com',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
        }
    def get_token(self):
        # 訪問GitHub的登錄頁面
        response = self.session.get(self.login_url,headers = self.headers)
        # 調用HTML類對HTML文本進行初始化，成功構造XPath解析對象，
        # 同時可以自動修正HMTL文本（標籤缺少閉合自動添加上）
        selector = etree.HTML(response.text)
        # 解析出登陸所需的authenticity_token信息
        token = selector.xpath("//input[@name='authenticity_token']/@value")
        print(token)
        return token

在登陸頁面獲取 token的value 參數,表單提交數據 form data 的 authenticity_token 護眼色前面獲取的 token,後面是個 post 請求，訪問攜帶 form_data 參數訪問，纔不會被阻攔

    def login(self):
        post_data = {
            'utf8':'✓',
            'authenticity_token':self.token,
            'login':'賬號',
            'password':'密碼',
        }
        response = self.session.post(self.post_url, data=post_data,headers = self.headers)
        if response.status_code == 200:
            print(response)
        else:
            print(response.status_code)

可以返回html看看結果，然後登陸成功後，會用session 保持會話，就可以做進一步的操作，在gethub裏面搜索項目或資料或者下載，盡情發揮，看到後面的 https://github.com/search? 還有後面攜帶的 parameters 的參數，和前面有一配製好久可以訪問了

    def search(self):
        key_name = input('搜索 Gethub項目 :')
        params = {
            "utf8": "✓",
            "q": key_name,
            "type":""
        }
        print(key_name)
        url = "https://github.com/search"
        response = self.session.get(url,headers=self.headers,params=params)
        print(response)

        return response.text

    def get_search(self,html):
        # class ="repo-list-item d-flex flex-column flex-md-row flex-justify-start py-4 public source"
        # class ="col-12 col-md-9 d-inline-block text-gray mb-2 pr-4"

        pattern = re.compile('<p class="col-12 col-md-9 d-inline-block text-gray mb-2 pr-4">(.*?)</p>',re.S)
        projects = re.findall(pattern,html)
        print(projects)

        for project in projects:
            print(project)

最後的出的結果，可以後期進一步分析，爬取獲取所有的項目,進行項目分析,哪些項目的star最多或評論最多等操作

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲模擬登陸Gethub並進行搜索

win 的子系統 ubuntu 窗口

Ubuntu18-的Nginx+uwsgi+django 的部署

Ubuntu Selenium 安裝 chrome 問題！！！

python爬蟲模擬登陸Gethub並進行搜索

金融——Loan_Prediction

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結