Python網絡爬蟲:正則表達式


常見匹配模式
在這裏插入圖片描述

1.re.match()

re.match() 嘗試從字符串的起始位置匹配一個模式,如果不是起始位置匹配成功的話,match()就返回none

常規匹配

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)
print(result.group())
print(result.span())

在這裏插入圖片描述

範匹配

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('Hello.*Demo',content)
print(result)
print(result.group())

在這裏插入圖片描述

匹配目標

匹配123 4567

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+\s\d+)\sWorld.*Demo$',content)
print(result)
print(result.group(1))

貪婪匹配

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$',content)
print(result)
print(result.group(1))

在這裏插入圖片描述
.*爲貪婪模式,即儘可能多地匹配

非貪婪匹配

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+\s\d+).*Demo$',content)
print(result)
print(result.group(1))

在這裏插入圖片描述
.*?爲非貪婪模式,匹配儘可能少的字符

匹配模式

import re

content = """Hello 123 4567 World_This
is a Regex Demo""" 
result = re.match('^He.*?(\d+\s\d+).*?Demo$',content)
print(result)

在這裏插入圖片描述
無法匹配換行符

指定匹配模式:

import re

content = """Hello 123 4567 World_This
is a Regex Demo""" 
result = re.match('^He.*?(\d+\s\d+).*?Demo$',content,re.S)
print(result)
print(result.group(1))

在這裏插入圖片描述

轉義

import re

content = "the price of shirt is $9.15"
result = re.match('the price of shirt is \$9\.15',content)
print(result)
print(result.group())

在這裏插入圖片描述
儘量使用範匹配,使用括號得到匹配目標;儘量使用非貪婪模式;有換行符就用re.S

2.re.search()

re.search()掃描整個字符串並返回第一個成功的匹配

import re

content = "the price of shirt is $9.15"
result = re.search('price',content)
print(result)
print(result.group())

在這裏插入圖片描述

import re
content = """<meta name="description" content="騰訊網從2003年創立至今,已經成爲集新聞信息,區域垂直生活服務、社會化媒體資訊和產品爲一體的互聯網媒體平臺。騰訊網下設新聞、科技、財經、娛樂、體育、汽車、時尚等多個頻道,充分滿足用戶對不同類型資訊的需求。同時專注不同領域內容,打造精品欄目,並順應技術發展趨勢,推出網絡直播等創新形式,改變了用戶獲取資訊的方式和習慣。" />"""
result = re.search('<meta.*?content="(.*?)"\s/>',content)
print(result)
print(result.group(1))

在這裏插入圖片描述

3.re.findall()

搜索字符串,以列表形式返回全部能匹配的子串

import re
content = """<ul class="nav-main fl" bossexpo="bg_dh_1">
    <li class="nav-item">
    <a href="http://news.qq.com/" target="_blank" bosszone="dh_1">新聞</a>
  </li>
    <li class="nav-item">
    <a href="http://v.qq.com/" target="_blank" bosszone="dh_2">視頻</a>
  </li>
    <li class="nav-item">
    <a href="http://new.qq.com/ch/photo/" target="_blank" bosszone="dh_3">圖片</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/milite/" target="_blank" bosszone="dh_4">軍事</a>
  </li>
    <li class="nav-item">
    <a href="https://sports.qq.com/" target="_blank" bosszone="dh_5">體育</a>
  </li>
    <li class="nav-item">
    <a href="http://sports.qq.com/nba/" target="_blank" bosszone="dh_6">NBA</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/ent/" target="_blank" bosszone="dh_7">娛樂</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/finance" target="_blank" bosszone="dh_8">財經</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/tech/" target="_blank" bosszone="dh_9">科技</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/fashion/" target="_blank" bosszone="dh_10">時尚</a>
  </li>
    <li class="nav-item">
    <a href="http://auto.qq.com/" target="_blank" bosszone="dh_11">汽車</a>
  </li>
    <li class="nav-item">
    <a href="http://house.qq.com/" target="_blank" bosszone="dh_12">房產</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/edu/" target="_blank" bosszone="dh_13">教育</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/cul/" target="_blank" bosszone="dh_14">文化</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/games/" target="_blank" bosszone="dh_15">遊戲</a>
  </li>
    <li class="nav-item">
    <a href="https://new.qq.com/ch/astro/" target="_blank" bosszone="dh_16">星座</a>
  </li>
  </ul><!--124ab1f2c59361a8f083289f63e618ba--><!--[if !IE]>|xGv00|c8ad5e7a2a8e8bd6a70240bd0844a132<![endif]-->
        <div class="nav-more fl">
  <div class="more-txt" bosszone="dh_more">更多</div>
  <div class="nav-sub" bossexpo="bg_dh_2">
    <ul class="sub-list cf">
            <li class="nav-item">
        <a href="https://new.qq.com/ch/ori/" target="_blank" bosszone="dh_1_2">獨家</a>
      </li>
            <li class="nav-item">
        <a href="https://v.qq.com/tv/" target="_blank" bosszone="dh_2_2">熱劇</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/antip/" target="_blank" bosszone="dh_3_2">抗肺炎</a>
      </li>
            <li class="nav-item">
        <a href="http://new.qq.com/ch/history/" target="_blank" bosszone="dh_4_2">歷史</a>
      </li>
            <li class="nav-item">
        <a href="http://sports.qq.com/premierleague/" target="_blank" bosszone="dh_5_2">英超</a>
      </li>
            <li class="nav-item">
        <a href="http://sports.qq.com/cba/" target="_blank" bosszone="dh_6_2">CBA</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch2/star" target="_blank" bosszone="dh_7_2">明星</a>
      </li>
            <li class="nav-item">
        <a href="http://money.qq.com/" target="_blank" bosszone="dh_8_2">理財</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/5G/" target="_blank" bosszone="dh_9_2">5G</a>
      </li>
            <li class="nav-item">
        <a href="http://health.qq.com/" target="_blank" bosszone="dh_10_2">健康</a>
      </li>
            <li class="nav-item">
        <a href="http://auto.qq.com/" target="_blank" bosszone="dh_11_2">車型</a>
      </li>
            <li class="nav-item">
        <a href="http://www.jia360.com" target="_blank" bosszone="dh_12_2">家居</a>
      </li>
            <li class="nav-item">
        <a href="http://class.qq.com/" target="_blank" bosszone="dh_13_2">課程</a>
      </li>
            <li class="nav-item">
        <a href="http://dajia.qq.com/" target="_blank" bosszone="dh_14_2">大家</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/comic/" target="_blank" bosszone="dh_15_2">動漫</a>
      </li>
            <li class="nav-item">
        <a href="http://gongyi.qq.com/" target="_blank" bosszone="dh_16_2">公益</a>
      </li>
            <li class="nav-item">
        <a href="http://tianqi.qq.com/index.htm" target="_blank" bosszone="dh_17_2">天氣</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/politics/" target="_blank" bosszone="dh_18_2">政務</a>
      </li>
            <li class="nav-item">
        <a href="https://v.qq.com/channel/variety" target="_blank" bosszone="dh_19_2">綜藝</a>
      </li>
            <li class="nav-item">
        <a href="http://news.qq.com/photon/photoex.htm" target="_blank" bosszone="dh_20_2">影展</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/world/" target="_blank" bosszone="dh_21_2">國際</a>
      </li>
            <li class="nav-item">
        <a href="http://sports.qq.com/csocce/csl/" target="_blank" bosszone="dh_22_2">中超</a>
      </li>
            <li class="nav-item">
        <a href="http://fans.sports.qq.com/#/" target="_blank" bosszone="dh_23_2">社區</a>
      </li>
            <li class="nav-item">
        <a href="http://v.qq.com/movie/" target="_blank" bosszone="dh_24_2">電影</a>
      </li>
            <li class="nav-item">
        <a href="http://stock.qq.com/" target="_blank" bosszone="dh_25_2">證券</a>
      </li>
            <li class="nav-item">
        <a href="http://digi.tech.qq.com/" target="_blank" bosszone="dh_26_2">數碼</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/baby/" target="_blank" bosszone="dh_27_2">育兒</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/visit/" target="_blank" bosszone="dh_28_2">旅遊</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/life/" target="_blank" bosszone="dh_29_2">生活</a>
      </li>
            <li class="nav-item">
        <a href="http://kid.qq.com/" target="_blank" bosszone="dh_30_2">兒童</a>
      </li>
            <li class="nav-item">
        <a href="http://book.qq.com/" target="_blank" bosszone="dh_31_2">文學</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/omv/" target="_blank" bosszone="dh_32_2">享看</a>
      </li>
            <li class="nav-item">
        <a href="https://new.qq.com/ch/cul_ru" target="_blank" bosszone="dh_33_2">新國風</a>
      </li>
            <li class="nav-item">
        <a href="http://www.qq.com/map/" target="_blank" bosszone="dh_34_2">全部</a>
      </li>
          </ul>

"""
result = re.findall('<a\shref.*?>(.*?)</a>',content,re.S)
print(result)

在這裏插入圖片描述

4.re.sub()

替換字符串中每一個匹配的子串後返回替換後的字符串。
去除數字:

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.sub('\d+','',content)
print(result)

在這裏插入圖片描述

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.sub('\d+','replace',content)
print(result)

在這裏插入圖片描述

5.re.compile()

將正則表達式編譯成正則表達式對象,以便於複用該匹配模式

import re

content = """Hello 123 4567 
World_This is a Regex Demo
"""
pattern = re.compile('Hello.*Demo',re.S)
result = re.match(pattern,content)
print(result)

6.小案例

爬取豆瓣讀書首頁的"新書速遞”欄目中的40本書(鏈接、作者、書名)

import requests
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}

r = requests.get('https://book.douban.com/',headers=headers)
content = r.text
print(r.status_code)
#print(content)
pattern = re.compile('<li.*?cover.*?href="(.*?)"\stitle="(.*?)".*?info.*?author">(.*?)</div>.*?more-meta.*?title">(.*?)</h4>.*?</li>',re.S)
result = re.findall(pattern,content)
for item in result:
    print(item[0])
    print(item[1])
    print(item[2].strip())

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章