urllib.request

一定义

urllib 库是python 内置的HTTP请求库

其官方文档链接为： https://docs.python.org/3/library/urllib.htrnl。

二、urllib四大模块

1 request模块

request ：它是最基本的 HTTP 请求模块，可以用来模拟发送请求。就像在浏览器里输入网址然后回车一样，只需要给库方法传入 URL 以及额外的参数，就可以模拟实现这个过程了。

2 error模块

异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作以保证程序不会意外终止。

3 parse模块

一个工具模块，提供了许多 URL 处理方法，比如拆分、解析、合并等。

4 robotparser模块

主要是用来识别网站的 robots.txt 文件，然后判断哪些网站可以爬，哪些网站不可以爬，它其实用得比较少。

三、发送请求

1.urlopen()发送请求

response.read(）

获取网页源码

type(reponse)

获取response的类型

response.getheaders()

获取报文的头部信息

response.getheader(“Server”)

传入Server参数后，只返回Server的名称

bytes(urllib.parse.urlencode({‘key’:‘value’}),encoding=‘utf-8’)

将字符串形式的data转换为字节流形式的data

参数介绍

urllib. request. urlopen(url, data=None, [timeout]*, cafile=None, capath=None, cadefault=False, context=None)

1 data 参数

如果要添加该参数，并且如果它是字节流编码格式的内容，即 bytes 类型，则需要通过 bytes（）方法转化。另外，如果传递了这个参数，则它的请求方式就不再是 GET方式，而是 POST 方式。
下面用实例来看一下：

import urllib.parse

import urllib.request 

data = bytes(urllib.parse.urlencode({'word’:’hello'}), encoding＝’ utf-8')

 response= urllib.request.urlopen('http://httpbin.org/post’, data=data)

 print(response.read())

这里我们传递了一个参数 word ，值是 hello 它需要被转码成 bytes （字节流）类型。其中转字节流采用了 bytes（）方法，该方法的第一个参数需要是 str （字符串）类型，需要用 urllib.parse 模块里的 urlencode （）方法来将参数字典转化为字符串；字典-------字符串-------字节流 urlencode()-------bytes()

第二个参数指定编码格式，这里指定为 utf8。

2 timeout

超时时间单位为秒

处理timeout异常代码

import socket
from urllib import request,error

url='http://www.baidu.com'

try:
	response=request.urlopen(url,timeout=1)
	print(response.read().decode('utf-8'))
except urllib.error.Urlerror as e:
	if isinstance(e.reason,socket.timeout):
		print('timeout')

3cafile 和 capath

这两个参数分别指定 CA证书和它的路径，这个在请求 HTTPS 链接时会有用。

官方文档： https://docs.python.org／Iibrary /url I ib. request. html。

2.另一个发送请求的函数Request（）

1 request可定制headers

利用 urlopen（）方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求。如果请求中需要加入 Headers 等信息，就可以利用更强大的 Request 类来构建。

其实Request（）就是重构请求头的数据结构，将这个请求头变成一个Rquest对象，发送请求仍然由urllib完成

通过构造这个数据结构，一方面我们可以将请求独立成一个对象，另一方面可更加丰富和灵活地配置参数

2.参数介绍

class  urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=Flase,method=None)

(1)url 必传参数

(2) data

参数，如果要传，必须传bytes(字节流)类型的。如果它是字典，可以先用urllib.parse模块里的urlencode()编码

(3)headers

字典，添加请求头的两种方法

构造请求头时，通过headers参数直接构造

调用请求示例的add_headers()方法添加

代码示例：

#使用Request（）方法构造headers

import urllib.request
import urllib.parse

url='http://httpbin.org/post'

headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}

data={'key':'value'}

data=bytes(urllib.parse.urlencode(data),encoding='utf-8')

request=urllib.request.Request(url=url,headers=headers,data=data)

response=urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

在Request对象后面使用add_headers()方法添加

#使用add_header
#request对象.add_user(字段名，字段值)
import urllib.request
import urllib.parse
url='http://www.baidu.com'

request=urllib.request.Request(url=url)

request.add_header('user-agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36')

r=urllib.request.urlopen(request)

print(r.read().decode('utf-8'))

(4)origin_req_host

请求方的host名称或者IP地址

请求头伪装的不好，可能通过该字段找到原始IP地址

(5)unverifiable?

表示这个请求是否是无法验证的，默认是 False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个 HTML 文档中的图片，但是我们没有向动抓取图像的权限，这时 unverifiable 的值就是 True

(6)method

3 验证、cookies 、代理请求头部的建立

简介Handdler

简而言之，我们可以把它理解为各种处理器，有专门处理登录验证的，有处理 Cookies 的，有处理代理设置的。利用它们，我们几乎可以做到 HTTP 请求中所有的事情。

（1）验证(构造请求体里的内容)

from  urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from  urllib.error import URLError

username='username'
password='password'
url='http://localhost:5000/'
#构造HTTPPassword对象
p=HTTPPasswordMgrWithDefaultRealm()
#实例化 HTTPBasicAuthHandler 对象
p.add_password(None,url,username,password)
auth_handler=HTTPBasicAuthHandler(p)
opener=build_opener(auth_handler)
#总结：HTTPPasswordMgrWithDefaultRealm()----->实例化add_password()
#----->HTTPBasicAuthHandler(p)------->
try:
  result=opener.open(url)
  html=result.read().decode('utf-8')
  print(html)
except URLError as e:
  print(e.reason)

（2)代理

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener
import urllib

proxy_handler = ProxyHandler({
  'http':'http://127.0.0.1:9743',
  'https':'http://127.0.0.1.9743'
})
opener=build_opener(proxy_handler)
try:
  url='http://www.baidu.com'
  response=opener.open(url)
  print(response.read().decode('utf-8'))
except urllib.error.URLError as e:

  print(e.reason)

[WinError 10061] 由于目标计算机积极拒绝，无法连接”出现这种情况的原因：
因为这是你的本地9743端口上并没有创建HTTP代理服务，即没有创建代理为127.0.0.0：9743的代理服务，所以会报错！

解决办法：
在西刺找到可以使用的免费的代理服务IP就可以啦！

西刺代理：https://www.xicidaili.com/nn/

原文链接：https://blog.csdn.net/qq_42908549/article/details/86706161

（3）Cookies

类太多，requests库有更好的库函数。

urllib.request

urllib.request

文章目录

一 定义

二、urllib四大模块

1 request模块

2 error模块

3 parse模块

4 robotparser模块

三、发送请求

1.urlopen()发送请求

相关函数

response.read(）

type(reponse)

response.getheaders()

response.getheader(“Server”)

bytes(urllib.parse.urlencode({‘key’:‘value’}),encoding=‘utf-8’)

参数介绍

1 data 参数

2 timeout

3cafile 和 capath

2.另一个发送请求的函数Request（）

1 request可定制headers

2.参数介绍

(1)url 必传参数

(2) data

(3)headers

(4)origin_req_host

(5)unverifiable?

(6)method

3 验证、cookies 、代理请求头部的建立

简介Handdler

（1）验证(构造请求体里的内容)

（2)代理

（3）Cookies

一定义