爬取某網站停車場的價格時遇到的列表裏包含反斜槓時的解決辦法

原創

2020-04-30 20:49

如果你覺得幫到了你，請點個贊

首先說說，因爲列表元素裏面包含反斜槓，所以在輸出的時候存在亂碼現象，於是我就拿字符串替換了一下反斜槓，因爲真正的目的是拿列表裏的元素對應字典的鍵，用來得到鍵值。但是在利用for循環的時候發現不能無縫銜接，只能添加最後一個元素，所以後來我查看我的博客裏面的字符串，查看到用join方法將列表裏的元素連接起來

原理：

list = ['2','0','2','0','-','4','-','2','8']
date="".join(list)
print(date)

效果圖：

接下來通過爬取一個網頁的價格（這個價格）來查看這個原理的應用：

原代碼：這是獲取一個頁面內的價格的python代碼，建立了價格字典，但是字典內的鍵會經常變，因爲網頁會更新，應該是防爬吧，用到了xpath模塊，想了解xpath的使用方法，可以看看這篇博主的博客

# -*- coding: utf-8 -*-
import scrapy
import requests
import random
from bs4 import BeautifulSoup
from lxml import etree 
from fake_useragent import UserAgent
user_agent=UserAgent().random
headers={'User-Agent':user_agent}
url = 'http://www.dianping.com/beijing/ch65/g180'
res = requests.get(url,headers=headers)
resource = res.text
soup =BeautifulSoup(resource,'lxml')
book = etree.HTML(str(soup))
lis =  book.xpath("//div[@id='shop-all-list']//li")
price_dict = {
                "1": "1",
                "uf2f6": "2",
                "uf48e": "3",
                "ue0f0": "4",
                "uf1ba": "5",
                "uec71": "6",
                "ue208": "7",
                "ue0fb": "8",
                "uf5b0": "9",
                "uf0f3": "0",
            }
for li in lis:
    price = li.xpath(".//a[@class='mean-price']/b//text()")
    #price = str(price).encode('ISO-8859-1').decode('gbk')
    #print(len(price))
    #price_list = []
    if price:
        #print(price)
        if len(price) == 1:
            print(price)
        else:
            for i in range(1,len(price)):
                text=str(price[i].encode('raw_unicode_escape').replace(b'\\',b''),'utf-8')
                if text in price_dict.keys():
                    text = price_dict[text]
            newPrice = "".join(price_list)
            print(price[0]+newPrice)

我發現這樣的代碼，在通過使用join方法連接之後總是連接最後一個數字，出來的價格總是殘缺的，即使改成下面這樣也不行

# -*- coding: utf-8 -*-
import scrapy
import requests
import random
from bs4 import BeautifulSoup
from lxml import etree 
from fake_useragent import UserAgent
user_agent=UserAgent().random
headers={'User-Agent':user_agent}
url = 'http://www.dianping.com/beijing/ch65/g180'
res = requests.get(url,headers=headers)
resource = res.text
soup =BeautifulSoup(resource,'lxml')
book = etree.HTML(str(soup))
lis =  book.xpath("//div[@id='shop-all-list']//li")
price_dict = {
                "1": "1",
                "uf2f6": "2",
                "uf48e": "3",
                "ue0f0": "4",
                "uf1ba": "5",
                "uec71": "6",
                "ue208": "7",
                "ue0fb": "8",
                "uf5b0": "9",
                "uf0f3": "0",
            }
for li in lis:
    newPrice = ""
    price = li.xpath(".//a[@class='mean-price']/b//text()")
    #price = str(price).encode('ISO-8859-1').decode('gbk')
    #print(len(price))
    price_list = []
    if price:
        #print(price)
        if len(price) == 1:
            print(price)
        else:
            for i in range(1,len(price)):
                text=str(price[i].encode('raw_unicode_escape').replace(b'\\',b''),'utf-8')
                if text in price_dict.keys():
                    text = price_dict[text]
                    price_list.append(text)
            newPrice = newPrice.join(price_list)
            print(price[0]+newPrice)

正確代碼：其中還涉及到了轉碼類型

# -*- coding: utf-8 -*-
import scrapy
import requests   #引入requests下載模塊
import random    #引入rondom隨機模塊
from bs4 import BeautifulSoup   #引入解析模塊
from lxml import etree #引入lxml解析模塊
from fake_useragent import UserAgent   #請求頭模塊
user_agent=UserAgent().random
headers={'User-Agent':user_agent}
url = 'http://www.dianping.com/beijing/ch65/g180'
res = requests.get(url,headers=headers)     #傳入url,以及請求頭
resource = res.text
soup =BeautifulSoup(resource,'lxml')   #解析爲lxml格式的文本
book = etree.HTML(str(soup))
lis =  book.xpath("//div[@id='shop-all-list']//li")
price_dict = {
                "1": "1",
                "ue7cd": "2",
                "ueb07": "3",
                "uecfc": "4",
                "ue64f": "5",
                "ue314": "6",
                "uf701": "7",
                "uf839": "8",
                "uf2fb": "9",
                "ue9d4": "0",
            }
item = {}
for li in lis:
    #newPrice = ""
    price = li.xpath(".//a[@class='mean-price']/b//text()")
    #price = str(price).encode('ISO-8859-1').decode('gbk')
    #print(len(price))
    price_list = []
    if price:
        if len(price) == 1:
            print(price[0])
        else:
            for i in range(1,len(price)):
                text=str(price[i].encode('raw_unicode_escape').replace(b'\\',b''),'utf-8')
                if text in price_dict.keys():
                    text = price_dict[text]
                    price_list.append(text)
            newPrice = "".join(price_list)
            print(price[0]+newPrice)

下面是運行結果圖：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

爬取某網站停車場的價格時遇到的列表裏包含反斜槓時的解決辦法

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

華爲eNSP上利用靜態路由和ospf來配置網絡

深入理解 MySQL 索引底層原理

死鎖避免涉及到的銀行家算法（操作系統）

應用層安全協議的五個種類

內存管理的基本原理和要求的解釋

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結