Python学习六：web数据抓取与分析

《毫无障碍学Python》邓文渊著学习笔记

web数据抓取与分析

1.网址解析

.
　　通过Python的 urlparse组件中的 urlparse函数，可对网址进行解析，其返回值为元组类型的ResultParse对象，通过其对象的属性可得到网地址中的个项数据

ResultParse函数属性：

	索引值	返回值	不存在的返回值
scheme	0	返回scheme通讯协议	空字符串
netloc	1	返回网站名称	空字符串
path	2	返回path路径	空字符串
params	3	返回url查询参数（(params)字符串	空字符串
query	4	返回query字符串，即GET参数；	空字符串
fragment	5	返回框架名称	空字符串
port	6	返回通信端口	None

解析中国天气网南昌天气的网址：http://www.weather.com.cn/weather1d/101240101.shtml#input

from urllib.parse import urlparse
url = 'http://www.weather.com.cn/weather1d/101240101.shtml#input'
o = urlparse(url)
print(o)

print("scheme={}".format(o.scheme))     # http
print("netloc={}".format(o.netloc))     # www.weather.com.cn
print("port={}".format(o.port))         # None
print("path={}".format(o.path))         # /weather1d/101240101.shtml
print("query={}".format(o.query))       # 空

运行结果：

2.网页数据抓取

.
　　 requests用于抓取网页源代码，由于他比内置urllib模块好用，因此逐渐取代了urllib，抓取源码后可用in或正则表达式搜索获取所需数据
　　导入requests,可用requests.get()函数模拟HTTP GET方法发出一个请求（Request）到远程服务器（Server）,当服务器接收请求后会响应（Response）并返回网页内容（源代码），设置正确的编码格式，既可通过text属性取得网址中源代码。

（1）获取网页源码

import requests
url = 'http://music.taihe.com/'	#以千千音乐网站为例
html = requests.get(url)    #print(type(html))为<class 'requests.models.Response'>
#html.encoding='utf-8'  #以utf-8编码抓取
print(html.text)    # html.text获取网页源码

#取得网页源码后，即可对源码加以处理，例：把每一行分割成列表并移除换行符
htmllist = html.text.splitlines()     #每一行分割成列表并移除换行符
for row in htmllist:
     print(row)
print(len(htmllist))

（2）搜索指定字符串

用text属性取得的源码为一大串字符串若想搜索其中指定字符或字符串可用in来完成

import requests
url = 'http://www.9ku.com/qingyinyue/'  #以九酷音乐网为例
html = requests.get(url)
html.encoding = 'utf-8'
if "流水" in html.text:
    print("找到！")
            ##也可一行行搜索 ，并统计该字符串出现的次数
htmllist = html.text.splitlines()
n = 0
for row in htmllist:
    if "流水" in row:n+=1
print('找到{}次'.format(n))

运行结果：

用正则表达式(regular expression简称regex)(对字符串操作的一种逻辑公式)抓取网站内容：复杂内容：超链接、电话号码…

常见正则表达式
.	表示一个除换行符(\n)外的所有字符
^	表示输入行的开始
$	表示输入行的结束
*	表示前一个项目可以出现0次或无数次
+	表示前一个项目可以出现1次或无数次
？	表示前一个项目可以出现0次或1次
[abc]	表示一个符号a或b或c的任何字符
[a-z]	表示一个符号a~z的任何字符
\	表示后面的字符以常规字符处理
{m}	表示前面的项目必须正好出现m次
{m,}	表示前面的项目至少出现m次，最多可出现无数次
{m,n}	表示前面的项目至少出现m次，最多可出现n次
\d	表示一个数字，相当于[0123456789]或[0-9]
^	求反运算,例[^a-d]表示除a,b,c,d外的所有字符
\D	一个非数字字符，即[^0-9]
\n	换行字
\r	换行符
\t	tab制表符
\s	空格符，即[\r\t\n\f]
\S	非空格符，即[^\r\t\n\f]
\w	一个数字、字母或下划线字符,相当于[0-9a-zA-Z_]
\W	一个非数字、字母或下划线字符,即[^\w]即[0-9a-zA-Z]

常见的正则表达式实例

	正则表达式	实例
整数	[0-9]+	12356
带小数点的实数	[0-9]+.[0-9]+	45.26
英文字符串	[A-Za-z]+	Python
变量名称	[A-Za-z_][A-Za-z0-9_]	_point
Emial	[a-zA-Z0-9]+@[a-zA-Z0-9._]+	[email protected]
URL	http://[a-zA-Z0-9./_]+	http://e-happy.com.tw/

.
　　 创建正则表达式对象：要使用正则表达式首先需导入re包，再用re包提供的compile方法创建一个正则表达式对象
　　　　语法：import re
　　　　　　　pat = re.compile(’[a-z]+’)
　　 正则表达式对象包含的方法：
　　 match(string) 　在string中查找符合正则表达式的规则的字符串，遇到第一个不符合的字符时结束，结果会存入match对象（object）中；若未找到符合的字符，返回None
　　 match对象包含的方法：

group() 　返回符合正则表达式的字符串，遇到第一个不符合的字符时结束，结果存入match对象（object）中；未找到符合字符返回None
end() 　返回match的结束为置
span() 　返回（开始位置，结束位置）的元组对象

import re
pat = re.compile('[a-z]+')
m = pat.match('tem12po')    #简单起见，将match对象的值赋值给m
print(m)
print(m.group())
print(m.start())
print(m.end())
print(m.span())

## re.match()方法：（可省略re.match()方法创建正则表达式的步骤）
import re
m = re.match(r'[a-z]+','tem12po')   #两个参数，第一个习惯在参数前加‘r'告诉编译器这个是正则表达式，第二个参数传递待搜索的字符串
print(m)

search(string) 在string中查找第一组符合正则表达式的字符串，找到后结束，结果存入match对象（object）中；若未找到符合的字符，返回None

## search(string)   在string中查找第一组符合正则表达式的字符串，找到后结束，结果存入match对象
#                   （object）中；若未找到符合的字符，返回None
import re
pat = re.compile('[a-z]+')
m = pat.search('3tem12po')
print(m)    #<_ser.SER_Match object;span=9(1,4),match='tem'>
if not m==None:
    print(m.group())    # tem
    print(m.start())    # 1
    print(m.end())      # 4
    print(m.span())     # (1,4)

 ## findall()        返回指定字符串中所有符合正则表达式的字串，并返回一个列表
import re
pat = re.compile('[a-z]+')
m = pat.findall('tem12po')
print(m)        # ['tem','po']

案例：抓取万水书苑网站中所有E-mail账号

import requests,re
regex = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')   #建立正则表达式对象
url = 'http://www.wsbookshow.com/' 
html = requests.get(url)        #抓取http://wsbookshow.com/网站的源代码
emails = regex.findall(html.text)   #在html.text中查询所有Email账号
for email in emails:
    print(email)

(3)网页分析

网页分析：HTML网页结构:许多标签组成<>

使用BeautifulSoup进行网页抓取与解析
　　导入BeautifulSoup后，先用requests包中的get方法取得网页源码，然后可用Python内建的html.parser解析器对源代码进行解析，解析结果返回到BeautifulSoup对象sp中,语法：sp = BeautifulSoup(源代码,‘html.parser’)

import requests,urllib3
from bs4 import BeautifulSoup

url = 'http://www.baidu.com'
html = requests.get(url)
html.encoding='utf-8'
sp = BeautifulSoup(html.text,'html.parser')

BeautifulSoup常用属性和方法：（假设已创建BeautifulSoup类的对象sp）

属性和方法
title	返回网页标题
text	返回除去所有HTML标签后的网页内容
find	返回第一个符合条件的标签，例：sp.find(“a”)
find_all()	返回所有符合条件的标签，例：sp.find_all(“a”)
select()	返回指定CSS样式(如id或class)的内容,返回值为列表！，例：sp.select("#id")通过id抓取

data1 = sp.select('title')
print(data1)
data2 = sp.select('#head')#抓取id="head"的网页源代码内容  id注意加“#”
#print(data2)
data3 = sp.select(".head_wrapper")#可通过css类的类名head_wrapper进行搜索    class注意加“.”
#print(data3)
data4 = sp.find('a')#取得第一个标签为a的内容
#print(data4)
data5 = sp.find_all('a')#取得所有标签为a的内容
#print(data5)
    # find_all(属性名称：属性内容)可抓取所有符合规定的tag(标签)内容，第二个参数是字典数据类型，
data5 = sp.find_all("a",{"class":"mnav"})
#print(data5[0])
    #通过一个列表参数可以一次搜索多个标签
data6 = sp.find_all(['a','title'])#用[]括起标签名！
#print(data6)

抓取属性内容：要抓取属性内容必须使用get()方法
　　语法：get(属性名称)

import requests
from bs4 import BeautifulSoup
url = 'http://www.baidu.com/'
html = requests.get(url)
html.encoding = 'gbk'
sp = BeautifulSoup(html.text,'html.parser')
links = sp.find_all(['a','img']) #同时抓取<a>和<img>
for link in links:
    href = link.get("href")#读取href属性值
    # 判断是否为非None，及是否以http://开头
    if href != None and href.startswith("http://"):
        print(href)

运行结果:

Python学习六：web数据抓取与分析

web数据抓取与分析

1.网址解析

2.网页数据抓取

（1）获取网页源码

（2）搜索指定字符串

(3)网页分析

Spring Cloud 部署时如何使用 Kubernetes 作为注册中心和配置中心

Python學習二：列表、循環、元組、字典

Python學習一：變量與數據類型、表達式、條件語句

Python學習六：web數據抓取與分析

Android開發學習二：用戶界面UI開發

Android開發學習三+：ListView的使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python学习六：web数据抓取与分析

web数据抓取与分析

1.网址解析

2.网页数据抓取

（1） 获取网页源码

（2）搜索指定字符串

(3)网页分析

（1）获取网页源码