爬虫系列教程三:requests详解

前言:

  1. 我从这部分内容开始逐步根据官方文档介绍教程二中提到的一些库;
  2. 爬虫的基础是与网页建立联系,而我们可以通过get和post两种方式来建立连接,而我们可以通过引入urllib库[在python3的环境下导入的是urllib;而python2的环境下是urllib和urllib2]或者requests库来实现,从程序的复杂度和可读性考虑,requests库显然更能满足程序员的需求,但是我没有找到这个库详细的中文讲解,这也是我写这篇文章的原因。
  3. 所有的参考资料均来源于官方文档http://docs.python-requests.org/en/master/user/quickstart/#make-a-request
  4. 文中可能有一些拓展知识,不喜欢可以略读过去。

如何使用requests库

  1. 首先我们需要导入requests包
import requests
  1. 然后我们可以通过get或者post(两者有一定的区别,请根据自己的需求合理的选择)来请求页面:
req_1 = requests.get('https://m.weibo.cn/status/4278783500356969')
req_2 = requests.post('https://m.weibo.cn/status/4278783500356969')
  • 这里多说一下我们通过这两个方式得到了什么?
    • Now, we have a Response object called req_1/req_2. We can get all the information we need from this object.

    这是官方文档中给出的说明,我们得到的是一个对象,里面包含了我们请求的页面的代码(可以print出来看一下)及相关信息,而我们可以通过’.'操作符来访问这个对象内的信息,在文末我会详细的归纳出来【注1】.

  • 再拓展一下我们对一个url还有哪些操作?
    req = requests.put('http://httpbin.org/put', data = {'key':'value'})
    req = requests.delete('http://httpbin.org/delete')
    req = requests.head('http://httpbin.org/get')
    req = requests.options('http://httpbin.org/get')
    
  1. 我们多数情况下还需要在请求中添加一些参数,如果你接触过urllib的话,你就会惊叹于requests的方便:
  • 先说一下如何将参数/表单,或者其它信息添加到请求中
    • get:
      payload = {'key1': 'value1', 'key2': 'value2'} # 这里的value可以为一个列表
      req = requests.get('http://httpbin.org/get', params=payload)
      
    • post:
      yourData = {'key':'value'}
      req = requests.post('http://httpbin.org/post', data=yourData)
      
    • 下面的例子是展示表单中可以有多种类型的值
      payload_tuples = [('key1', 'value1'), ('key1', 'value2')]
      r1 = requests.post('http://httpbin.org/post', data=payload_tuples)
      payload_dict = {'key1': ['value1', 'value2']}
      r2 = requests.post('http://httpbin.org/post', data=payload_dict)
      print(r1.text)
      {
        ...
        "form": {
          "key1": [
            "value1",
            "value2"
          ]
        },
        ...
      }
      r1.text == r2.text
      True
      
    • 这个例子是说明表单的编码的形式是多样的,比如以json来传递
      #写法一
      import json
      url = 'https://api.github.com/some/endpoint'
      payload = {'some': 'data'}
      req = requests.post(url, data=json.dumps(payload))
      
      #写法二
      url = 'https://api.github.com/some/endpoint'
      payload = {'some': 'data'}
      req = requests.post(url, json=payload)
      
    • 如果你想传递header的话
      get:
      
      headers = {'user-agent': 'my-app/0.0.1'}
      req = requests.get(url, headers=headers)
      post:
      
      header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
      data = {'_xsrf': xsrf, 'email': '邮箱', 'password': '密码',
              'remember_me': True}
      session = requests.Session()
      result = session.post('https://www.zhihu.com/login/email', headers=header, data=data) #这里的result是一个json格式的字符串,里面包含了登录结果
      
    • 如果你想传递cookie的话
      get:
      
      url = 'http://httpbin.org/cookies'
      req = requests.get(url, cookies=dict(cookies_are='working'))
      post:
      
      import requests
      r = requests.get(url1)  # 你第一次的url
      headers = {
          'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
          'Accept-Encoding':'gzip, deflate, sdch',
          'Accept-Language':'zh-CN,zh;q=0.8',
          'Connection':'keep-alive',
          'Cache-Control':'no-cache',
          'Content-Length':'6',
          'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
          'Host':'www.mm131.com',
          'Pragma':'no-cache',
          'Origin':'http://www.mm131.com/xinggan/',
          'Upgrade-Insecure-Requests':'1',
          'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
          'X-Requested-With':'XMLHttpRequest'
      }  # headers的例子,看你的post的headers
      headers['cookie'] = ';'.join([headers['cookie'], ['='.join(i) for i in r.cookies.items()]])
      r = requests.post(url2, headers=headers, data=data)  # 你第二次的url
      
    • 如果你想传递文件
      post:
      
      #低阶版:
      url = 'http://httpbin.org/post'
      files = {'file': open('report.xls', 'rb')}
      
      req = requests.post(url, files=files)
      req.text
      {
        ...
        "files": {
          "file": "<censored...binary...data>"
        },
        ...
      }
      
      #进阶版:
      url = 'http://httpbin.org/post'
      files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
      
      req = requests.post(url, files=files)
      req.text
      {
        ...
        "files": {
          "file": "<censored...binary...data>"
        },
        ...
      }
      
    • 其实字符串也可以上传:
      url = 'http://httpbin.org/post'
      files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')}
      
      req = requests.post(url, files=files)
      req.text
      {
        ...
        "files": {
          "file": "some,data,to,send\\nanother,row,to,send\\n"
        },
        ...
      }
      
  • 再拓展一下get和post的函数原型,可以让大家对参数有一个更加全面的了解:
    get:
    
    def get(url, params=None, **kwargs):
        r"""Sends a GET request.
    
        :param url: URL for the new :class:`Request` object.
        :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
        :param \*\*kwargs: Optional arguments that ``request`` takes.
        :return: :class:`Response <Response>` object
        :rtype: requests.Response
        """
    
        kwargs.setdefault('allow_redirects', True)
        return request('get', url, params=params, **kwargs)
    post:
    
    def post(url, data=None, json=None, **kwargs):
        r"""Sends a POST request.
    
        :param url: URL for the new :class:`Request` object.
        :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
        :param json: (optional) json data to send in the body of the :class:`Request`.
        :param \*\*kwargs: Optional arguments that ``request`` takes.
        :return: :class:`Response <Response>` object
        :rtype: requests.Response
        """
    
        return request('post', url, data=data, json=json, **kwargs)
    
  • 然后拓展一个打印出添加了参数的之后的url的方法:
    print(req.url)
    
  • 我们需要注意的另一个事情是编码问题:
    • 你如果使用print(req.text),那么requests会自动帮你编码来显示结果(原文件是以二进制形式返回的,而urllib则需要手动编码),如果你想改变编码方式也很简单:req.encoding = ‘ISO-8859-1’
    • 而如果你想要得到一个二进制的结果:
      req.content()
      
    • 另外你如果想要一个json格式的结果 :
        req.json()
      
    • !一定要做异常的处理,很有可能请求的网页与json不适配或者压根请求就出问题
      • 如果你想要一个未经过处理的response:
        req = requests.get('https://api.github.com/events', stream=True)
        req.raw
        <urllib3.response.HTTPResponse object at 0x101194810>
        
        req.raw.read(10)
        '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
        
      • 当然,我们需要做一些异常的处理
        with open(filename, 'wb') as fd:
            for chunk in r.iter_content(chunk_size=128):
                fd.write(chunk)
        
  1. 如果你需要获取response的信息的话:

    req.headers
    {
        'content-encoding': 'gzip',
        'transfer-encoding': 'chunked',
        'connection': 'close',
        'server': 'nginx/1.0.4',
        'x-runtime': '148ms',
        'etag': '"e1ca502697e5c9317743dc078f67693f"',
        'content-type': 'application/json'
    }
    
    req.headers['Content-Type']
    'application/json'
    
    req.headers.get('content-type')
    'application/json'
    
  2. 如何取得cookies并使用:

    #基本取出
    >>> url = 'http://example.com/some/cookie/setting/url'
    >>> r = requests.get(url)
    
    >>> r.cookies['example_cookie_name']
    'example_cookie_value'
    #基本使用
    >>> url = 'http://httpbin.org/cookies'
    >>> cookies = dict(cookies_are='working')
    
    >>> r = requests.get(url, cookies=cookies)
    >>> r.text
    '{"cookies": {"cookies_are": "working"}}'
    
    
    #使用cookiesJar来完成两个过程
    >>> jar = requests.cookies.RequestsCookieJar()
    >>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
    >>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
    >>> url = 'http://httpbin.org/cookies'
    >>> r = requests.get(url, cookies=jar)
    >>> r.text
    '{"cookies": {"tasty_cookie": "yum"}}'
    

6,其它内容(挖坑以后填):

  • 状态码
  • 超时
  • 异常和错误的处理
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章