爬蟲學習筆記（七）數據解析——正則 2020.5.7

原創

2020-05-08 03:48

前言

本節開始學習數據解析
本節學習正則的用法

正則的對應表可以網上搜下
主要如下：

1、簡單的匹配

import re
# 貪婪模式  從開頭匹配到結尾，*的用法，默認
one = 'mdfsdsfffdsn12345656n'
pattern = re.compile('m(.*)n')
result = pattern.findall(one)
print(result)
# 非貪婪，？的用法
pattern = re.compile('m(.*?)n')
result = pattern.findall(one)
print(result)

# 轉譯的問題
two = "a\d"
pattern = re.compile('a\\b')
result = pattern.findall(two)
print(result)

# . 除了 換行符號\n 之外的 匹配
three = """
    msfdsdffdsdfsn
    1234567778888N
"""
pattern = re.compile('m(.*)n')
result = pattern.findall(three)
print(result)
# 沒有換行的匹配
pattern = re.compile('m(.*)n', re.S | re.I) #兩個修飾符，因爲大小寫問題
result = pattern.findall(three)
print(result)

# 純數字的正則 \d 0-9之間的一個數
four = '234'
pattern = re.compile('^\d+$')
# 匹配判斷的方法
# match 方法 是否匹配成功 從頭開始 匹配一次
result = pattern.match(four)
print(result.group())

# 範圍運算 [123] [1-9]
five = '7893452'
pattern = re.compile('[1-9]')
result = pattern.findall(five)
print(result)

2、一些命令的用法

import re
one = 'abc 123'
patter = re.compile('\d+')
# match 從頭匹配 匹配一次
result = patter.match(one)
print(result)
# search 從任意位置 , 匹配一次
result = patter.search(one)
print(result)
# findall  查找符合正則的 內容 -- list
result = patter.findall(one)
print(result)
# sub  替換字符串
result = patter.sub('#',one)
print(result)
# split  拆分
patter = re.compile(' ')
result = patter.split(one)
print(result)

# 匹配中文
two = '<a href="https://www.baidu.com/" nslog="normal" nslog-type="10600112" data-href="https://www.baidu.com/s?ie=utf-8&amp;fr=bks0000&amp;wd=">網頁是最新版本的,適配移動端</a>'
# python中 匹配中文 unicode的範圍 * + ?
pattern = re.compile('[\u4e00-\u9fa5]')
result = pattern.findall(two)
print(result)
pattern = re.compile('[\u4e00-\u9fa5]+')
result = pattern.findall(two)
print(result)

3、一個簡單例子

import re
import requests
url = 'http://news.baidu.com/'
headers = {
    "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
# response.text 不太準確 轉碼 是靠推測
data = requests.get(url, headers=headers).content.decode()
# 正則解析 數據
# 每個新聞的titile, url
pattern = re.compile('<a href="(.*?)" target="_blank" mon="(.*?)">(.*?)</a>')
# pattern = re.compile('<a (.*?)</a>',re.S)
result = pattern.findall(data)
print(result)
with open('02news.html', 'w') as f:
     f.write(data)

結語

正則好用啊

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

爬蟲學習筆記（七）數據解析——正則 2020.5.7

前言

1、簡單的匹配

2、一些命令的用法

3、一個簡單例子

結語

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

深度學習系列（八）計算性能（命令式編程和符號式編程、異步計算、多GPU計算) 2020.6.25

leetcode刷題記錄441-450 python版

深度學習系列（十）計算機視覺之目標檢測（object detection）2020.6.29

深度學習系列（三）深度卷積神經網絡（AlexNet、VGG、NiN、GoogleNet） 2020.6.18

leetcode刷題記錄431-440 python版

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結