aiohttp遇到非法字符的處理(UnicodeDecodeError: 'utf-8' codec can't decode bytes in position......)

這個問題困擾了我將近一天時間,如果使用text()函數會一直報“UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 24461-24462: invalid continuation byte”的錯誤,如果使用read()函數以二進制輸出在後面解析的時候中文是亂碼,網上查了很多資料,主要也是自己的疏忽自己看了源碼,一直糾結在編碼問題忽略了另一個帶默認值的參數

下面是解決方案:

import aiohttp
import asyncio

headers = {
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "zh-CN,zh;q=0.8",
    }

async def ss():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://www.iteye.com/blogs/tag/java',headers=headers) as resp:
            print(resp.status)
            # ignore,則會忽略非法字符,默認是strict,代表遇到非法字符時拋出異常
            d = (await resp.text("utf-8","ignore"))
            # d = await resp.read()
            # d = await resp.text()
            cc(d)

def cc(v):
    print(v)
    soup = BeautifulSoup(v, "lxml")
    contents = soup.select("div.content")
    for conten in contents:
        articleAuthor = conten.select("div.blog_info > a")
        if articleAuthor:
            print(articleAuthor)
            articleAuthor = articleAuthor[0]
        else:
            articleAuthor = ""
        print(articleAuthor)

loop = asyncio.get_event_loop()
tasks = [ss() ]
loop.run_until_complete(asyncio.gather(*tasks))

這樣結果中文就正常顯示("utf-8"其實不是必須的因爲默認就是utf-8):


如果是await resp.text()就直接報錯:


如果是await resp.read(),在解析時中文亂碼:


text的源碼:

@asyncio.coroutine
def text(self, encoding=None, errors='strict'):
    """Read response payload and decode."""
    if self._content is None:
        yield from self.read()

    if encoding is None:
        encoding = self._get_encoding()

    return self._content.decode(encoding, errors=errors)

默認的參數就是strict,代表遇到非法字符時拋出異常; 
如果設置爲ignore,則會忽略非法字符;





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章