Python定向爬蟲入門

一、基本的正則表達式

正則表達式用來提取爬蟲中需要的公共要素

1、正則表達式符號與方法

常用符號：點號、星號、問號與括號
常用方法：findall、search、sub

.:匹配任意字符，換行符\n除外
:匹配前一個字符0次或無限次
?:匹配前一個字符0次或1次
.:貪心算法（吃儘可能的東西）
.*?:非貪心算法（像嬰兒少量多餐）
（）:括號內的數據作爲結果返回

findall：匹配所有符合規律的內容，返回包含結果的列表
search：匹配並提取第一個符合規律的內容，返回一個正則式表達對象（object）
sub：替換符合規律的內容，返回替換後的值

Python中正則表達式的庫文件
import re

點號的使用：
a=‘xy123’
b=re.findall(‘x.’ , a)
print b #xy

c=re.findall(‘x…’ , a)
print c #xy1
點就是佔位符，幾個點就是幾個符號

星號的使用：
a=‘xyxy123’
b=re.findall(‘x*’ , a)
print b #[‘x’,’’,‘x’,’’,’’,’’,’’,’’,’’]
匹配前面的字符，並查找出所有位置

問號的使用：
a=‘xy123’
b=re.findall(‘x？’ , a)
print b #[‘x’,’’,’’,’’,’’,’’,’’,’’]

*.的使用：
search_code=‘hahfajxxixxfalflsjfslfjslfjxxlovexxljsljfsxxyouxxsjflsdjflsj’

a=re.findall(‘xx.*xx’, search_code)
print(a)
#[‘xxixxfalflsjfslfjslfjxxlovexxljsljfsxxyouxx’]
.*在滿足規則時，能找多少找多少

.*？的使用：
b=re.findall('xx.？xx’, search_code)
print(b)
#[‘xxixx’,‘xxlovexx’,‘xxyouxx’]
.?少量多餐，首先滿足條件，然後儘可能多的挑選儘可能多的組合。

（.*？）的使用： 五星重點
c=re.findall(‘xx（.*？）xx’, search_code)
print©
#[‘i’,‘love’,‘you’]
print c
for each in d:
print each,
#i love you

加入換行符：
s=’’‘sdkfjxxhello
xxfslfjslxxworldxxafjf’’’
d=re.findall(‘xx（.*？）xx’, s)
print d
#[‘fslfjsl’]
原因：.可以匹配任意字符，換行符除外

d=re.findall(‘xx（.*？）xx’, s，re.s)
.匹配任意字符，包含換行符
#[‘hello\n’,‘world’]

對比findall和search的區別：
s2=‘asdfxxIxx123xxlovexxdfd’
f=re.search(‘xx(.xx.?)xx123xx(.?)xx’, s2).group(1)
print(f)
#I

f=re.search(‘xx(.xx.?)xx123xx(.?)xx’, s2).group(2)
print(f)
#love

f2=re.search(‘xx(.xx.?)xx123xx(.?)xx’, s2)
print（f2）
#f2[0][1]
group（）代表的是括號的個數

f爲tuple（元組）類型

sub的使用：
s=‘123adkfjslksl23’
output=re.sub(‘123(.?)123’, ‘123789’,s)
output2=re.sub('123(.?)123’, ‘123%d’%12434,s)
print(output)
#123789

otput=re.sub(‘123(.*?)1’, ‘aaaa’,s)
print(otput)
#aaaa23
替換掉與匹配項相同的字符，其餘的不變

2、正則表達式的常用技巧

import re
form re import *
from re import findall，search，sub，S（這樣寫可以省掉re.）

python compile()方法：
compile() 函數將一個字符串編譯爲字節代碼。

匹配數字（\d+）：
a=‘dfsfjsl13233skdfjsldf’
b=re.findall(’(\d+)’,a)
print(b)

3、正則表達式的應用舉例

（1）使用findall與search從大量文本中匹配感興趣的內容
先抓大後抓小
text_fied=re.findall(’

(.*?)’,html,re.S)[0]
the_text=re.findall（提取內容text_fied ）
需要觀察具體的本文內容來進行設計=

（2）使用sub實現翻頁功能
for i in range(2, total_pape+1):
new_link=re.sub(‘pageNum=\d+’,'pageNum=%d’i, old_url, re.S)
print new_link

4、python爬蟲實戰

目標網站：http://www.jikexueyuan.com/
目標內容：課程圖片
實現原理：
1、保存網頁源代碼
2、python讀文件加載源代碼
3、正則表達式提取圖片網址
4、下載圖片

二、python單線程爬蟲

1、requests介紹與安裝

requests：
http for humens
requests庫是一個常用的用於http請求的模塊，它使用python語言編寫，可以方便的對網頁進行爬取，是學習python爬蟲的較好的http請求模塊。
相關鏈接：https://blog.csdn.net/pittpakk/article/details/81218566
優點：
完美替代了python的urllib2模塊
更多的自動化
更友好的用戶體驗
更完善的功能

安裝：
Windows：pip install requests
Linux: sudo pip install requests
anaconda :conda install requests

第三方庫安裝技巧：
少用easy_install，因爲只能安裝不能卸載
多用pip方式安裝
撞牆了怎麼辦？請戳：
https://www.lfd.uci.edu/~gohlke/pythonlibs/

界面如下：

在其中搜索需要的pythonlib包，並進行下載。
下載的文件後綴名爲.whl，修改後綴名爲.zip，並進行解壓縮

這裏的requests文件就是我們需要的，直接將其放到python目錄下的lib文件中就可以了，此時requests的庫就可以使用了。（requests-2.6.0.dist-info是無用的）

2、製作網頁爬蟲

requests獲取網頁源代碼：
（1）直接獲取源代碼
（2）修改http頭獲取源代碼

import requests
html=requests.get('https://www.easyicon.net/')
print html.txt

修改http頭獲取源代碼：（反偵察）

import requests

header={‘User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36’}
#獲取方式見下方

html=requests.get('https://www.easyicon.net/'， headers=header)
html.encoding='utf-8'
print html.txt

打開想要爬取的網頁，右鍵審覈元素/檢查，找到network，點擊網址進行刷新，出現新的鏈接時，隨便選擇一個，下滑找到request headers，複製過來即可。

requests與正則表達式：
使用request獲取網頁源代碼，再使用正則表達式匹配出感興趣的內容，這就是單線程簡單爬蟲的基本原理。

3、向網頁提交數據

get與post介紹：
get是從服務器上獲取數據
post是向服務器傳送數據
get通過構造url中的參數來實現功能

requests表單提交：
核心方法：requests.post
核心步驟：構造表單-提交表單-獲取返回信息

show more：異步加載

打開審覈元素：找到network，然後點擊show more，會看到headers離requests method的方法爲post。
下滑會看到from data，page爲2.

常規獲取網頁數據的方法：

import requests
import re
url=‘https://www.crowdfunder.com/browse/deals’
url2='https://www.crowdfunder.com/browse/deals&template=false'  #也來源於network

html=requests.get(url).text
print(html)

new method:(網頁包含異步加載的)
data={
entities_only='true'
'page':'1'   #可修改
}
html_post=requests.post(url,data=data)
title=re.findall('"card-title">(.*?)</di>', html_post.test,re.S)
for each in title:
	print(each)

4、實戰-爬蟲

涉及知識：
requests獲取網頁
re.sub換頁
正則表達式匹配內容

Python定向爬蟲入門

一、基本的正則表達式

1、正則表達式符號與方法

2、正則表達式的常用技巧

3、正則表達式的應用舉例

4、python爬蟲實戰

二、python單線程爬蟲

1、requests介紹與安裝

2、製作網頁爬蟲

3、向網頁提交數據

4、實戰-爬蟲

瓜子網二手車筆試題

c# 多線程及同步異步問題

python語法知識

matlab GUI編程及轉換爲獨立運行的exe文件

opencv + python環境搭建與基礎入門

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結