aiohttp遇到非法字符的處理(UnicodeDecodeError: 'utf-8' codec can't decode bytes in position......)

原創

2020-06-24 22:32

這個問題困擾了我將近一天時間，如果使用text()函數會一直報“UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 24461-24462: invalid continuation byte”的錯誤，如果使用read()函數以二進制輸出在後面解析的時候中文是亂碼，網上查了很多資料，主要也是自己的疏忽自己看了源碼，一直糾結在編碼問題忽略了另一個帶默認值的參數

下面是解決方案：

import aiohttp
import asyncio

headers = {
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "zh-CN,zh;q=0.8",
    }

async def ss():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://www.iteye.com/blogs/tag/java',headers=headers) as resp:
            print(resp.status)
            # ignore，則會忽略非法字符，默認是strict，代表遇到非法字符時拋出異常
            d = (await resp.text("utf-8","ignore"))
            # d = await resp.read()
            # d = await resp.text()
            cc(d)

def cc(v):
    print(v)
    soup = BeautifulSoup(v, "lxml")
    contents = soup.select("div.content")
    for conten in contents:
        articleAuthor = conten.select("div.blog_info > a")
        if articleAuthor:
            print(articleAuthor)
            articleAuthor = articleAuthor[0]
        else:
            articleAuthor = ""
        print(articleAuthor)

loop = asyncio.get_event_loop()
tasks = [ss() ]
loop.run_until_complete(asyncio.gather(*tasks))

這樣結果中文就正常顯示("utf-8"其實不是必須的因爲默認就是utf-8)：

如果是await resp.text()就直接報錯：

如果是await resp.read()，在解析時中文亂碼：

text的源碼：

@asyncio.coroutine
def text(self, encoding=None, errors='strict'):
    """Read response payload and decode."""
    if self._content is None:
        yield from self.read()

    if encoding is None:
        encoding = self._get_encoding()

    return self._content.decode(encoding, errors=errors)

默認的參數就是strict，代表遇到非法字符時拋出異常；
如果設置爲ignore，則會忽略非法字符；

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

aiohttp遇到非法字符的處理(UnicodeDecodeError: 'utf-8' codec can't decode bytes in position......)

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

python交互模式熱加載究極實現方式

《javascrip編程精解》第二版習題練習(未完，根據自己學習進度更新)

aiohttp遇到非法字符的處理(UnicodeDecodeError: 'utf-8' codec can't decode bytes in position......)

python3教程(在線學習地址)

仿scrapy的爬蟲框架 (python3.5以上模塊化,需要支持async/await語法)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結