爬取某网站停车场的价格时遇到的列表里包含反斜杠时的解决办法

如果你觉得帮到了你,请点个赞

首先说说,因为列表元素里面包含反斜杠,所以在输出的时候存在乱码现象,于是我就拿字符串替换了一下反斜杠,因为真正的目的是拿列表里的元素对应字典的键,用来得到键值。但是在利用for循环的时候发现不能无缝衔接,只能添加最后一个元素,所以后来我查看我的博客里面的字符串,查看到用join方法将列表里的元素连接起来

原理:

list = ['2','0','2','0','-','4','-','2','8']
date="".join(list)
print(date)

效果图:

 

接下来通过爬取一个网页的价格(这个价格)来查看这个原理的应用:

原代码:这是获取一个页面内的价格的python代码,建立了价格字典,但是字典内的键会经常变,因为网页会更新,应该是防爬吧,用到了xpath模块,想了解xpath的使用方法,可以看看这篇博主的博客

# -*- coding: utf-8 -*-
import scrapy
import requests
import random
from bs4 import BeautifulSoup
from lxml import etree 
from fake_useragent import UserAgent
user_agent=UserAgent().random
headers={'User-Agent':user_agent}
url = 'http://www.dianping.com/beijing/ch65/g180'
res = requests.get(url,headers=headers)
resource = res.text
soup =BeautifulSoup(resource,'lxml')
book = etree.HTML(str(soup))
lis =  book.xpath("//div[@id='shop-all-list']//li")
price_dict = {
                "1": "1",
                "uf2f6": "2",
                "uf48e": "3",
                "ue0f0": "4",
                "uf1ba": "5",
                "uec71": "6",
                "ue208": "7",
                "ue0fb": "8",
                "uf5b0": "9",
                "uf0f3": "0",
            }
for li in lis:
    price = li.xpath(".//a[@class='mean-price']/b//text()")
    #price = str(price).encode('ISO-8859-1').decode('gbk')
    #print(len(price))
    #price_list = []
    if price:
        #print(price)
        if len(price) == 1:
            print(price)
        else:
            for i in range(1,len(price)):
                text=str(price[i].encode('raw_unicode_escape').replace(b'\\',b''),'utf-8')
                if text in price_dict.keys():
                    text = price_dict[text]
            newPrice = "".join(price_list)
            print(price[0]+newPrice)
        

我发现这样的代码,在通过使用join方法连接之后总是连接最后一个数字,出来的价格总是残缺的,即使改成下面这样也不行

# -*- coding: utf-8 -*-
import scrapy
import requests
import random
from bs4 import BeautifulSoup
from lxml import etree 
from fake_useragent import UserAgent
user_agent=UserAgent().random
headers={'User-Agent':user_agent}
url = 'http://www.dianping.com/beijing/ch65/g180'
res = requests.get(url,headers=headers)
resource = res.text
soup =BeautifulSoup(resource,'lxml')
book = etree.HTML(str(soup))
lis =  book.xpath("//div[@id='shop-all-list']//li")
price_dict = {
                "1": "1",
                "uf2f6": "2",
                "uf48e": "3",
                "ue0f0": "4",
                "uf1ba": "5",
                "uec71": "6",
                "ue208": "7",
                "ue0fb": "8",
                "uf5b0": "9",
                "uf0f3": "0",
            }
for li in lis:
    newPrice = ""
    price = li.xpath(".//a[@class='mean-price']/b//text()")
    #price = str(price).encode('ISO-8859-1').decode('gbk')
    #print(len(price))
    price_list = []
    if price:
        #print(price)
        if len(price) == 1:
            print(price)
        else:
            for i in range(1,len(price)):
                text=str(price[i].encode('raw_unicode_escape').replace(b'\\',b''),'utf-8')
                if text in price_dict.keys():
                    text = price_dict[text]
                    price_list.append(text)
            newPrice = newPrice.join(price_list)
            print(price[0]+newPrice)
        

 

正确代码: 其中还涉及到了转码类型

# -*- coding: utf-8 -*-
import scrapy
import requests   #引入requests下载模块
import random    #引入rondom随机模块
from bs4 import BeautifulSoup   #引入解析模块
from lxml import etree #引入lxml解析模块
from fake_useragent import UserAgent   #请求头模块
user_agent=UserAgent().random
headers={'User-Agent':user_agent}
url = 'http://www.dianping.com/beijing/ch65/g180'
res = requests.get(url,headers=headers)     #传入url,以及请求头
resource = res.text
soup =BeautifulSoup(resource,'lxml')   #解析为lxml格式的文本
book = etree.HTML(str(soup))
lis =  book.xpath("//div[@id='shop-all-list']//li")
price_dict = {
                "1": "1",
                "ue7cd": "2",
                "ueb07": "3",
                "uecfc": "4",
                "ue64f": "5",
                "ue314": "6",
                "uf701": "7",
                "uf839": "8",
                "uf2fb": "9",
                "ue9d4": "0",
            }
item = {}
for li in lis:
    #newPrice = ""
    price = li.xpath(".//a[@class='mean-price']/b//text()")
    #price = str(price).encode('ISO-8859-1').decode('gbk')
    #print(len(price))
    price_list = []
    if price:
        if len(price) == 1:
            print(price[0])
        else:
            for i in range(1,len(price)):
                text=str(price[i].encode('raw_unicode_escape').replace(b'\\',b''),'utf-8')
                if text in price_dict.keys():
                    text = price_dict[text]
                    price_list.append(text)
            newPrice = "".join(price_list)
            print(price[0]+newPrice)
        

下面是运行结果图:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章