這個問題困擾了我將近一天時間,如果使用text()函數會一直報“UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 24461-24462: invalid continuation byte”的錯誤,如果使用read()函數以二進制輸出在後面解析的時候中文是亂碼,網上查了很多資料,主要也是自己的疏忽自己看了源碼,一直糾結在編碼問題忽略了另一個帶默認值的參數
下面是解決方案:
import aiohttp
import asyncio
headers = {
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "zh-CN,zh;q=0.8",
}
async def ss():
async with aiohttp.ClientSession() as session:
async with session.get('http://www.iteye.com/blogs/tag/java',headers=headers) as resp:
print(resp.status)
# ignore,則會忽略非法字符,默認是strict,代表遇到非法字符時拋出異常
d = (await resp.text("utf-8","ignore"))
# d = await resp.read()
# d = await resp.text()
cc(d)
def cc(v):
print(v)
soup = BeautifulSoup(v, "lxml")
contents = soup.select("div.content")
for conten in contents:
articleAuthor = conten.select("div.blog_info > a")
if articleAuthor:
print(articleAuthor)
articleAuthor = articleAuthor[0]
else:
articleAuthor = ""
print(articleAuthor)
loop = asyncio.get_event_loop()
tasks = [ss() ]
loop.run_until_complete(asyncio.gather(*tasks))
這樣結果中文就正常顯示("utf-8"其實不是必須的因爲默認就是utf-8):
如果是await resp.text()就直接報錯:
如果是await resp.read(),在解析時中文亂碼:
text的源碼:
@asyncio.coroutine
def text(self, encoding=None, errors='strict'):
"""Read response payload and decode."""
if self._content is None:
yield from self.read()
if encoding is None:
encoding = self._get_encoding()
return self._content.decode(encoding, errors=errors)
默認的參數就是strict,代表遇到非法字符時拋出異常;
如果設置爲ignore,則會忽略非法字符;