爬蟲學習筆記

Python3 爬蟲教程

學習網站：https://www.bilibili.com/video/av9784617?from=search&seid=3311514956616305524

BeautifulSoup

是用來解析HTML元素、形成標籤樹的庫。

在HTML中，每個元素tag包含：標籤名tag.name、屬性域tag.attris、內容字符串tag.string。

HTML文檔–>標籤樹–>BeautifulSoup類

BeautifulSoup 類的基本元素

基本元素	說明
Tag	標籤，最基本的信息組織單元，分別用<>和</>標明開頭和結尾
Name	標籤的名字， … 的名字’p’，格式：.name
Attributes	標籤的屬性，字典形式組織，格式：.attrs
NavigableString	標籤內非屬性字符串，<>…</>中字符串，格式：.string
Comment	標籤內的字符串的註釋部分，一種特殊的Comment類型

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.a) #打印第一個a標籤內容  
print(soup.a.name) #打印a標籤的名字
print(soup.a.parent.name)#打印a父親標籤的名字
print(soup.a.attrs)#打印a標籤中的屬性
print(soup.a.attrs["class"])#打印a標籤中class屬性的值
print(type(soup.a.attrs))#打印標籤屬性通過處理後的類型
print(type(soup.a))#標籤的類型
print(soup.a.string)#打印標籤內字符串信息

http://python123.io/ws/demo.html內容：

<html>
	<head>
		<title>This is a python demo page</title>
	</head>
	<body>
		<p class="title">
			<b>The demo python introduces several python courses.</b>
		</p>
		<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
			<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> 
			and 
			<a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.
		</p>
	</body>
</html>

HTML標籤的基本樹形結構

標籤樹的下行遍歷

屬性	說明
.contents	子節點的列表，將所有兒子節點存入列表
.children	子節點的迭代類型，與.contents類似，用於循環遍歷兒子節點
.descendants	子孫節點的迭代類型，包含所有子孫節點，用於遍歷循環

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.head.contents)#返回head標籤子節點的列表
print(soup.body.contents)#返回body標籤子節點的列表
print(len(soup.body.contents))#返回列表長度
print(soup.body.contents[1])#返回列表中的第二個元素


for child in soup.body.children:#遍歷兒子節點
    print(child)
    
for desc in soup.body.descendants:#遍歷孫子節點
    print(desc)

標籤樹的上行遍歷

屬性	說明
.parent	節點的父親標籤
.parents	節點先輩標籤的迭代類型，用於循環遍歷先輩節點

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.title.parent)#打印title的父標籤
print(soup.html.parent)#打印html的父標籤

for parent in soup.a.parents:#遍歷a標籤的所有父標籤
    if parent is None:
        print(parent)
    else:
        print(parent.name)

標籤樹的平行遍歷

屬性	說明
.next_sibling	返回按照HTML文本順序的下一個平行節點標籤
.previous_sibling	返回按照HTML文本順序的上一個平行節點標籤
.next_siblings	迭代類型，返回按照HTML文本順序的後續所有平行節點標籤
.previous_siblings	迭代類型，返回按照HTML文本順序的前續所有平行節點標籤

平行遍歷發生在同一個父節點下的各節點間

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.a.next_sibling)#返回a標籤下一個平行節點標籤
print(soup.a.next_sibling.next_sibling)#返回a標籤下下一個平行節點標籤
print(soup.a.previous_sibling)#返回a標籤前一個平行節點標籤

for sibling in soup.a.next_siblings:#平行遍歷後續節點
    print(sibling)
    
for sibling in soup.a.previous_siblings:#平行遍歷前續節點
    print(sibling)

基於bs4庫的HTML格式和編碼

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.prettify())#格式化輸出

輸出：

三種信息標記

基於bs4庫的HTML內容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)

返回一個列表類型，存儲查找結果

name:對標籤名稱的檢索字符串

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.find_all("a"))#返回所有a標籤

print(soup.find_all(["a","b"]))#返回所有a標籤和b標籤

attrs:對標籤屬性值的檢索字符串，可標註屬性檢索

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.find_all("p","course"))#返回p標籤屬性值爲'course'的標籤

print(soup.find_all(id='link1'))#查找屬性id='link1'的標籤

recursive:是否對子孫全部檢索，默認爲True

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.find_all('a', recursive=False) #該文檔的第一層子標籤沒有a標籤)

string:<>…</>中字符串區域的檢索字符串

import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")

print(soup.find_all(string="Basic Python"))#精確查找相關字符串

擴展方法：

方法	說明
<>.find()	檢索且只返回一個結果，字符串類型，同.find_all()參數
<>.find_parents()	在先輩節點中搜索，返回列表類型，同.find_all()參數
<>.find_parent()	在先輩節點中返回一個結果，字符串類型，同.find()參數
<>.find_next_siblings()	在後續平行節點中搜索，返回列表類型，同.find_all()參數
<>.find_next_sibling()	在後續平行節點中中返回一個結果，字符串類型，同.find()參數
<>.find_previous_siblings()	在前續平行節點中搜索，返回列表類型，同.find_all()參數
<>.find_previous_sibling()	在前續平行節點中中返回一個結果，字符串類型，同.find()參數

中國大學排名實例

from bs4 import BeautifulSoup
import requests
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        # r.encoding=r.apparent_encoding
        r.encoding = 'utf-8'
        return r.text
    except:
        print("獲得網站文本失敗！")

def fillUniList(html, num, uInfo):
    count = 1
    s = BeautifulSoup(html, 'html.parser')
    for tr in s.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            uInfo.append([tds[0].string, tds[1].string, tds[3].string])
            count += 1
        if count>num: 
            break

def printUniList(uInfo): 
    tplt = "{0:^10}\t{1:^10}\t{2:^10}"
    print(tplt.format("排名", "學校", "總分"))
    for i in range(len(uInfo)):
        u = uInfo[i]
        print(tplt.format(u[0], u[1], u[2]))

def main():
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    num = 20
    uInfo = []
    html = getHTMLText(url)
    fillUniList(html, num, uInfo)
    printUniList(uInfo)

if __name__ == '__main__':
	main()

正則表達式

操作符	說明	實例
.	表示任何單個字符
[]	字符集，對單個字符給出取值範圍	[abc]表示a或b或c；[a-z]表示a到z單個字符
[^ ]	非字符集，對單個字符給出排除範圍	[^abc]表示非a或b或c的單個字符
*	前一個字符0次或無線次擴展	abc*表示ab或abc或abcc等
+	前一個字符1次或無線次擴展	abc+表示abc或abcc等
?	前一個字符0次或1次擴展	abc?表示ab或abc
\|	左右表達式任意一個	abc\|def 表示abd或def
{m}	擴展前一個字符m次	ab{2}c表示abbc
{m,n}	擴展前一個字符m到n次(含n)	ab{1,2}c表示abc或abbc
^	匹配字符串開頭	^abc表示abc且在一個字符串的開頭
$	匹配字符串結尾	abc$表示abc且在一個字符串的結尾
()	分組標記，內部只能使用\|操作符	(abc)表示abc，(abc,def)表示abc或def
\d	數字，等價於[0-9]
\w	單詞字符，等價於[A-Za-z0-9]

¹+$ : 由26個字母組成的字符串

^-?\d+$ : 整數形式的字符串

²+[0-9]*$:正整數形式的字符串

Re庫

Re庫主要功能函數

函數	說明
re.search()	在一個字符串中搜索匹配正則表達式的附一個位置，返回match()對象
re.match()	從一個字符串的開始位置起匹配正則表達式，返回match對象
re.findall()	搜索字符串，以列表類型返回全部能匹配的子串
re.split()	將一個字符串按照正則表達式匹配結果進行分割，返回列表類型
re.finditer()	搜索字符串，返回一個匹配結果的迭代類型，每個迭代元素是match對象
re.sub()	在一個字符串中替換所有匹配正則表達式的子串，返回替換後的字符串

flags可選擇值

常用標記	說明
re.I re.IGNORECASE	忽略正則表達式的大小寫，[A-Z]能匹配小寫字符
re.M re.MULTILING	正則表達式中的^操作符能夠將給定字符串的每行當做匹配開始
re.S re.DOTALL	正則表達式中的 . 操作符能夠匹配所有字符，默認匹配除換行符外的所有字符

re.search()

re.search(pattern,string,flags=0)

在一個字符串中搜索匹配正則表達式的第一個位置，返回match對象

pattern:正則表達式的字符串或原生字符串表示
string:待匹配字符串
flags：正則表達式使用時的控制標記

import re
match=re.search(r'[1-9]\d{5}','BIT 100081')
if match:
    print(match.group(0))

re.match()

re.match(pattern,string,flags=0)

從一個字符串的開始位置起匹配正則表達式，返回match對象

pattern:正則表達式的字符串或原生字符串表示
string:待匹配字符串
flags：正則表達式使用時的控制標記

import re
match=re.match(r'[1-9]\d{5}','100081 BIT')
if match:
    print(match.group(0))

re.findall()

re.findall(pattern,string,flags=0)

搜索字符串，以列表類型返回全部能匹配的子串

pattern:正則表達式的字符串或原生字符串表示
string:待匹配字符串
flags：正則表達式使用時的控制標記

import re
f_all= re.findall(r"[1-9]\d{5}","BT100081 SH132132 FJ132431432")
print(f_all)

re.split()

re.split(pattern,string,maxsplit=0,flags=0)

將一個字符串按照正則表達式匹配結果進行分割，返回列表類型

pattern:正則表達式的字符串或原生字符串表示
string:待匹配字符串
maxsplit:最大分割數，剩餘部分作爲最後一個元素輸出
flags：正則表達式使用時的控制標記

import re
print(re.split(r'[1-9]\d{5}','dad100081 fsv100084'))
print(re.split(r'[1-9]\d{5}','dad100081 fsv100084',maxsplit=1))

re.finditer()

re.finditer(pattern,string,flags=0)

搜索字符串，返回一個匹配結果的迭代類型，每個迭代元素是match對象

pattern:正則表達式的字符串或原生字符串表示
string:待匹配字符串
flags：正則表達式使用時的控制標記

import re
for i in re.finditer(r'[1-9]\d{5}','dad100081 fsv100084'):
    if i:
        print(i.group(0))

re.sub()

re.sub(pattern,repl,string,count=0,flags=0)

在一個字符串中替換所有匹配正則表達式的子串，返回替換後的字符串

pattern:正則表達式的字符串或原生字符串表示
repl:替換匹配字符串的字符串
string:待匹配字符串
count：匹配的最大替換次數
flags：正則表達式使用時的控制標記

import re
print(re.sub(r"[1-9]\d{5}",":world","BIT100081 SHS123123214345 SHDHKJ1231"))

RE庫的另一種用法：

函數式：一次性操作

inport re
match=re.search(r'[1-9]\d{5}','BIT 100081')

面向對象：編譯後的多次操作

import re
regex=re.compile(r"[1-9]\d{5}")
match=regex.search("BIT 100081")

regex=re.compile()

regex=re.compile(pattern,flags=0)

將正則表達式的字符串形式編譯成正則表達式對象

pattern:正則表達式的字符串或原生字符串表示
flags：正則表達式使用時的控制標記

Re庫的match對象

Match對象的屬性

屬性	說明
.string	待匹配的文本
.re	匹配時使用的pattern對象(正則表達式)
.pos	正則表達式搜索文本的開始位置
.endpos	正則表達式搜索文本的結束位置

import re
match=re.search(r'[1-9]\d{5}','BIT 100081 HDJ23323213 JDK434382')
print(".string:",match.string)
print(".re:",match.re)
print(".pos:",match.pos)
print(".endpos:",match.endpos)

Match對象的方法

方法	說明
.group()	獲得匹配後的字符串
.start()	匹配字符串在原始字符串的開始位置
.end()	匹配字符串在原始字符串的結束位置
.span()	返回(.start(),.end())

import re
match=re.search(r'[1-9]\d{5}','BIT 100081')
print(match)
if match:
    print(match.group(0))
    print(match.start())
    print(match.end())
    print(match.span())

Re庫的貪婪匹配和最小匹配

貪婪匹配

Re庫默認採用貪婪匹配，即輸出匹配最長的子串

import re
match=re.search(r"PY.*N","PYANBNCNDN")
print(match.group(0))

最小匹配

import re
match =re.search(r"PY.*?N","PYANBNCNDN")
print(match.group(0))

最小匹配操作符

操作符	說明
*?	前一個字符0次或無限次擴展，最小匹配
+?	前一個字符1次或無限次擴展，最小匹配
??	前一個字符0次或1一次擴展，最小匹配
{m,n}?	擴展前一個字符m至n次(含n)，最小匹配

淘寶商品信息定向爬取實例

import requests
import re
headers={
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
"cookie": "登陸後訪問頁面的cookie"	
}
def getHTMLText(url):
	try:
		r=requests.get(url,headers=headers,timeout=30)
		r.raise_for_status()
		r.encoding="utf-8"
		return r.text
	except:
		print("網絡獲取失敗！")
def parsePage(ilt,html):
	try:
		plt=re.findall(r'\"view_price\"\:\"[\d.]*\"',html)
		tlt=re.findall(r'\"raw_title\"\:\".*?\"',html)
		for i in range(len(plt)):
			price=eval(plt[i].split(":")[1])
			title=eval(tlt[i].split(":")[1])
			ilt.append([price,title])
	except:
		print("網頁獲取失敗！")
def printGoodsList(ilt):
	tplt="{:4}\t{:8}\t{:16}"
	print(tplt.format("序號","價格","商品名稱"))
	count=0
	for g in ilt:
		count=count+1
		print(tplt.format(count,g[0],g[1]))

def main():
	depth=2
	start_url="https://s.taobao.com/list?spm=a217q.8031046.292818.2.3ab1789d9NSnuq&q=%E7%94%B7%E5%8C%85&cat=50072686&style=grid&seller_type=taobao&fs=1&auction_tag%5B%5D=12034"
	infoList=[]
	for i in range(depth):
		try:
			url=start_url+"&s="+str(44*i)
			html=getHTMLText(url)
			parsePage(infoList,html)
		except:
			continue
	printGoodsList(infoList)

if __name__ == '__main__':
	main()

股票定向爬蟲實例

import requests
from  bs4  import BeautifulSoup
# import traceback
import re

def getHTMLText(url):
	try:
		r=requests.get(url,timeout=30)
		r.raise_for_status()
		r.encoding="utf-8"
		return r.text
	except:
		# print("網站連接失敗！")
		return ""

def getStockList(lst,stockURL):
	html=getHTMLText(stockURL)
	soup=BeautifulSoup(html,"lxml")
	a=soup.find_all('a')
	for i in a:
		try:
			href=i.attrs["href"]
			lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
		except:
			continue

def getStockInfo(lst,stockURL,fpath):
	count=0
	for stock in lst:
		url=stockURL+stock +".html"
		html=getHTMLText(url)
		try:
			if html=="":
				continue
			infoDict={}
			soup=BeautifulSoup(html,"lxml")
			stockInfo=soup.find('div',attrs={'class','stock-bets'})
			name=stockInfo.find_all(attrs={'class','bets-name'})[0]
			infoDict.update({'股票名稱':name.text.split()[0]})
			keyList=stockInfo.find_all('dt')
			valueList=stockInfo.find_all('dd')
			for i in range(len(keyList)):
				key=keyList[i].text
				val=valueList[i].text
				infoDict[key]=val
			with open(fpath,'a',encoding='utf-8') as f:
				f.write(str(infoDict)+'\n')
				count=count+1
				print("\r當前速度：{:.2f}%".format(count*100/len(lst),end=''))
		except:
			count=count+1
			print("\r當前速度：{:.2f}%".format(count*100/len(lst),end=''))
			continue



	return ""

def main():
	stock_list_url="http://quote.eastmoney.com/stocklist.html"
	stock_info_url="http://www.eastmoney.com/stock/"
	output_file="./output_file.txt"
	slist=[]
	getStockList(slist,stock_list_url)
	getStockInfo(slist,stock_info_url,output_file)

if __name__ == '__main__':
	main()

Scrapy爬蟲框架結構

安裝：pip install scrapy

Requests庫和Scrapy框架的比較

Requests VS Scrapy

Requests	Scrapy
頁面級爬蟲	網站級爬蟲
功能庫	框架
併發性考慮不足，性能差	併發性好，性能較高
重點在於網頁下載	重點在於爬蟲結構
定製靈活	一般定製靈活，深度定製困難
上手十分簡單	入門稍難

Scrapy常用命令

命令	說明	格式
startproject	創建一個新工程	scrapy startproject [dir]
genspider	創建一個爬蟲	scrapy genspider [options]
settings	獲得爬蟲配置信息	scrapy settings [options]
crawl	運行一個爬蟲	scrapy crawl
list	列出工程中的所有爬蟲	scrapy list
shell	啓動url調試命令行	scrapy shell [url]

第一個實例

生成一個爬蟲

scrapy genspider demo python123.io

生成的demo.py

# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo'
    allowed_domains = ['python123.io']
    start_urls = ['http://python123.io/']

    def parse(self, response):
        pass

parse()用於處理響應，解析內容形成字典，發現新的URL爬取請求

配置生成的spider爬蟲:

# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo'
    # allowed_domains = ['python123.io']
    start_urls = ['http://python123.io/ws/demo.html']

    def parse(self, response):
    	fname=response.url.split('/')[-1]
    	with open(fname,'wb') as f:
    		f.write(response.body)
    	self.log('Save file %s.' % fname)

運行demo爬蟲：

demo.py代碼的完整版本：

# -*- coding: utf-8 -*-
import scrapy

class DemoSpider(scrapy.Spider):
    name = 'demo'
    # allowed_domains = ['python123.io']
    def start_requests(self):
    	urls = ['http://python123.io/ws/demo.html']
    	for url in urls:
    		yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
    	fname=response.url.split('/')[-1]
    	with open(fname,'wb') as f:
    		f.write(response.body)
    	self.log('Save file %s.' % fname)

兩者等價

yield關鍵字

例子：


#生成器寫法
def gen(n):
	for i in range(n):
		yield i**2

for i in gen(5):
	print(i," ",end="")

#普通寫法
def square(n):
	ls=[i**2 for i in range(n)]
	return ls

for i in square(5):
	print(i," ",end="")

生成器高效快捷，佔用計算資源少，可以處理大數據

scrapy的基本使用

Request類

class scrapy.http.Request()

表示一個http請求
由Spider生成,由Downloader執行

Request類型

屬性或方法	說明
.url	Request對應的請求URL地址
.method	對應的請求方法，“GET”,“POST”等
.headers	字典類型風格的請求頭
.body	請求內容主體，字符串類型
.meta	用戶添加的擴展信息，在Scrapy內部模塊間傳遞信息使用
.copy()	複製該請求

Response類

class scrapy.http.Response()

Response對象表示一個http響應
由Downloader生成，由Spider處理

Response類型

屬性或方法	說明
.url	Response對應的請求URL地址
.status	HTTP狀態碼，默認是200
.headers	Response對應的頭部信息
.body	Response對應的內容信息，字符串類型
.flags	一組標記
.request	產生Response類型對應的Request對象
.copy()	複製該響應

Item類

class scrapy.item.Item()

Item對象表示一個從HTML頁面中提取的信息內容
由Spider生成，由Item Pipeline處理
Item類似字典型，可以按照字典類型操作

Scrapy爬蟲提取信息的方法

scrapy爬蟲框架支持多種HTML信息提取方法

BeautifulSoup
lxml
re
XPath Selector
CSS Selector

股票數據scrapy爬取實例

步驟：

scrapy startproject BaiduStocks
cd BaiduStocks/
scrapy genspider stocks baidu.com
cd BaiduStocks/spiders

編寫stocks.py

# -*- coding: utf-8 -*-
import scrapy
import re

class StocksSpider(scrapy.Spider):
    name = 'stocks'
    start_urls = ['http://quote.eastmoney.com/stocklist.html']

    def parse(self, response):
    	for href in response.css('a::attr(href)').extract():
    		try:
    			stock=re.findall(r"[s][hz]\d{6}",href)[0]
    			url='http://gupiao.baidu.com/stock/'+stock+'.html'
    			yield scrapy.Request(url,callback=self.parse_stock)
    		except:
    			continue
    def parse_stock(self,response):
    	infoDict={}
    	stockInfo=response.css('.stock-bets')
    	name=stocklist.css('.bets-name').extract()[0]
    	keyList=stockInfo.css('dt').extract()
    	valueList=stockInfo.css('dd').extract()
    	for i in range(len(keyList)):
    		key=re.findall(r'>.*</dt>',keyList[i])[0][1:-5]
    		try:
    			val=re.findall(r'\d+\.?.*</dd>',valueList[i])[0][0:-5]
    		except:
    			val='--'
    		infoDict[key]=val

    	infoDict.update({'股票名稱':re.findall(r'\s.*\(',name)[0].split()[0]+\
    		re.findall(r'\>.*\<',name)[0][1:-1]})
    	yield infoDict

編寫pipeline.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item

#自寫類
class BaidustocksInfoPipeline(object):
	def open_spider(self,spider):
		self.f=open('BaidustocksInfo.txt','w')

	def close_spider(self,spider):
		self.f.close()

	def process_item(self, item, spider):
        try:
        	line=str(dict(item))+'\n'
        	self.f.write(line)
        except:
        	pass
        return item

在settings.py中配置，來調用自寫類，將ITEM_PIPELINES設置如下，然後保存：

配置併發連接選項

配置文件 settings.py

settings.py文件

選項	說明
CONCURRENT_REQUESTS	DOWnLoader最大併發請求下載數量，默認是32
CONCURRENT_ITEMS	ItemPipeline最大併發ITEM處理數量，默認是100
CONCURRENT_REQUESTS_PER_DOMAIN	每個目標域名的最大併發請求數量，默認是8
CONCURRENT_REQUESTS_PER_IP	每個目標IP的最大併發請求數量，默認是0，非0有效

A-Za-z ↩︎
1-9 ↩︎

極光時流

發佈了24 篇原創文章 · 獲贊 120 · 訪問量 3萬+

私信關注

爬蟲學習筆記

Python3 爬蟲教程

文章目錄

BeautifulSoup

HTML標籤的基本樹形結構

三種信息標記

基於bs4庫的HTML內容查找方法

中國大學排名實例

正則表達式

Re庫

re.search()

re.match()

re.findall()

re.split()

re.finditer()

re.sub()

regex=re.compile()

Re庫的match對象

Re庫的貪婪匹配和最小匹配

淘寶商品信息定向爬取實例

股票定向爬蟲實例

Scrapy爬蟲框架結構

Requests庫和Scrapy框架的比較

第一個實例

生成一個爬蟲

yield關鍵字

scrapy的基本使用

股票數據scrapy爬取實例

2018新師大9.18校內賽

sqli-labs學習筆記(Less1-Less24)

Ubuntu18+Apache2+MySQL+PHP5.6+sqli-labs配置教程

2019第十二屆全國大學生信息安全競賽部分WriteUp

爬蟲學習筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

爬蟲學習筆記

Python3 爬蟲教程

文章目錄

BeautifulSoup

HTML標籤的基本樹形結構

三種信息標記

基於bs4庫的HTML內容查找方法

中國大學排名實例

正則表達式

Re庫

re.search()

re.match()

re.findall()

re.split()

re.finditer()

re.sub()

regex=re.compile()

Re庫的match對象

Re庫的貪婪匹配和最小匹配

淘寶商品信息定向爬取實例

股票定向爬蟲 實例

Scrapy爬蟲框架結構

Requests庫和Scrapy框架的比較

第一個實例

生成一個爬蟲

yield關鍵字

scrapy的基本使用

股票數據scrapy爬取實例

股票定向爬蟲實例