HtmlAgilityPack 抓取中文頁面亂碼問題的解決方案

HtmlAgilityPack是用C#寫的開源Html Parser。不過它的某些方面設計不盡完善，比如，按照其正常模式抓取中文網頁，往往獲得的是亂碼。比如，抓取新華網首頁(http://xinhua.org)。模仿HtmlAgilityPack示例，爬取代碼如下：

            HtmlWeb hw = new HtmlWeb();
           string url = @"http://xinhua.org";
            HtmlDocument doc = hw.Load(url);
            doc.Save("output.html");

獲得的頁面用ie打開，是亂碼。

穿越HtmlAgilityPack的代碼迷宮，最後發現問題出在HtmlWeb類的Get(Uri uri, string method, string path, HtmlDocument doc)方法中。該方法有以下代碼：

HttpWebResponse resp; try { resp = req.GetResponse() as HttpWebResponse; } …… if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0)) { respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding); } else { respenc = null; } …… Stream s = resp.GetResponseStream(); if (s != null) { if (UsingCache) { // NOTE: LastModified does not contain milliseconds, so we remove them to the file SaveStream(s, cachePath, RemoveMilliseconds(resp.LastModified), _streamBufferSize); // save headers SaveCacheHeaders(req.RequestUri, resp); if (path != null) { // copy and touch the file IOLibrary.CopyAlways(cachePath, path); File.SetLastWriteTime(path, File.GetLastWriteTime(cachePath)); } } else { // try to work in-memory if ((doc != null) && (html)) { if (respenc != null) { doc.Load(s, respenc); } } else { doc.Load(s, true); } } } resp.Close(); }

其中resp是http請求的response。設置斷點發現resp.ContentEncoding爲空。於是最後的加載行爲便變成了doc.Load(s, true);而這個load方法也可能出了問題，最後得到的是亂碼。

解決方法：

不使用HttpWeb，該類不成熟。自己寫http請求，代碼如下：

HttpWebRequest req; req = WebRequest.Create(new Uri(@"http://xinhua.org")) as HttpWebRequest; req.Method = "GET"; WebResponse rs = req.GetResponse(); Stream rss = rs.GetResponseStream(); String url = @"http://xinhua.org"; try { HtmlDocument doc = new HtmlDocument(); doc.Load(rss); doc.Save("output.html"); } catch (Exception e) { Console.WriteLine(e.Message.ToString()); Console.WriteLine(e.StackTrace); }

上面代碼中，doc.Load(…) 使用的編碼爲System.Text.Encoding.Default，在我機器上爲gb2312編碼。
HtmlDocument也可以指定編碼load stream。獲得指定編碼有兩種方法：
（1）在HttpWebResponse 對象中可以獲取html代碼中設置的charset；
（2）未提供charset的html頁面，HtmlDocument提供了自動檢測代碼的方法DetectEncoding(…)。這一方法俺爲測試過，不知道正確性如何.

摘自：http://community.icburner.com/blogs/vs2010tests/archive/2009/07/09/better-html-parsing-and-validation-with-htmlagilitypack.aspx