2.1 认识HTTP请求
2.1.1 HTTP请求的含义
2.1.2 HTTP请求信息
1. 请求方法
2. 请求头部
2.2 爬虫基础-Requests库入门
2.2.1 Requests库的安装
2.2.2 Requests库的请求方法
import requests
# get 获取
response = requests.get('https://www.douban.com/')
# post 提交
requests.post('https://www.douban.com/')
2.2.3 Requests库的响应对象
2.2.4 响应状态码
418 反爬虫
200 正常登录
import requests
url = 'https://www.douban.com/search'
r = requests.get(url)
# 状态码
code = r.status_code
print(code)
没有定制头部文件,被反爬虫了
2.2.5 定制请求头部
# headers 头部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
# 网址
url = 'https://www.douban.com/search'
# get 获取
r = requests.get(url, headers=headers)
2.2.6 重定向与超时
# timeout=3 3秒内网页无反应抛出timeout异常
r = requests.get(url, headers=headers, timeout=3)
# 重定向 ,重新定位到网页,相当于重新访问,刷新
r.history
2.2.7 传递URL参数
import requests
# headers 头部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
# 网址
url = 'https://www.douban.com/search'
payload = {'q': 'python', 'cat': '1001'}
# get 获取
# timeout=3 3秒内网页无反应抛出timeout异常
r = requests.get(url, headers=headers, timeout=3,params=payload)
url = r.url
print(url)
2.7.1 更改cat
1、搜索全部,不加cat
import requests
# headers 头部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
# 网址
url = 'https://www.douban.com/search'
payload = {'q': 'python'}
# get 获取
# timeout=3 3秒内网页无反应抛出timeout异常
r = requests.get(url, headers=headers, timeout=3,params=payload)
url = r.url
print(url)
2、搜索图片,cat=1025
import requests
# headers 头部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
# 网址
url = 'https://www.douban.com/search'
payload = {'q': 'python', 'cat': '1025'}
# get 获取
# timeout=3 3秒内网页无反应抛出timeout异常
r = requests.get(url, headers=headers, timeout=3,params=payload)
url = r.url
print(url)
2.3 爬虫基础——Urllib库基础
2.3.1 Urllib库简介
2.3.2 发送GET请求
不自定义头部文件,都会被反爬虫
2.3.3 模拟浏览器发送GET请求
和Requests库一样,需要定义头部才行
2.3.4 POST发送一个请求
2.3.5 URL解析
1. urlparse: 拆分URL
from urllib.parse import urlparse
# 1. urlparse: 拆分URL
urlparse = urlparse('https://www.douban.com/search?cat=1001&q=python')
print(urlparse)
2. urlunparse: 拼接URL
# 2. urlunparse: 拼接URL
from urllib.parse import urlunparse
data = ['https', 'www.douban.com', '/search', '', 'cat=1001&q=python', '']
print(urlunparse(data))
3.urljoin: 拼接两个URL
# 3.urljoin: 拼接两个URL
from urllib.parse import urljoin
urljoin = urljoin('https://www.douban.com', 'accounts/login')
print(urljoin)
4. 总代码
# -*- coding: utf-8 -*-
from urllib.parse import urlparse
# 1. urlparse: 拆分URL
urlparse = urlparse('https://www.douban.com/search?cat=1001&q=python')
print(urlparse)
# ParseResult(scheme='https',
# netloc='www.douban.com',
# path='/search',
# params='',
# query='cat=1001&q=python',
# fragment='')
# 2. urlunparse: 拼接URL
from urllib.parse import urlunparse
data = ['https', 'www.douban.com', '/search', '', 'cat=1001&q=python', '']
print(urlunparse(data))
# 3.urljoin: 拼接两个URL
from urllib.parse import urljoin
urljoin = urljoin('https://www.douban.com', 'accounts/login')
print(urljoin)