C#使用正則表達式提取網頁中的信息數據

大家好，今天來分享一下在ASP.NET中如何通過正則表達式的使用來獲取HTML的信息。如我們所知，網頁中經常會包含一些非常有用的信息，比如網頁標題（title），文本（text），圖片（image），鏈接（link），表格（table），一些搜索引擎的工程師很可能需要關注這方面的信息，通常他們需要在網頁中查詢一些關鍵字，圖片等信息。

這裏介紹一下怎麼在.NET中通過正則表達式快速的獲取這些信息, 我們需要在VS2010中建立一個空的web應用程序：

首先需要製作一個源頁面，本頁面包含一些基本信息，也就是需要獲取信息的源頁面，這裏這個頁面包括文本，腳本，圖片和鏈接等信息。

[本示例完整源碼下載(0分)] http://download.csdn.net/source/3450356

在本項目中頁面的頭部都需要設置AutoEventWireup屬性，

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="SourcePage.aspx.cs" Inherits="CSASPNETStripHtmlCode.SourcePages" %>

AutoEventWireup 屬性被設置爲 true時該頁框架將自動調用頁面的事件，在本例中如果不這樣設置，第二次執行獲取HTML代碼的方法將會失敗。

SourcePage.aspx

<html xmlns="http://www.w3.org/1999/xhtml"> <head id="Head1" runat="server"> <title></title> </head> <mce:script type="text/javascript"></mce:script> <mce:script type="text/javascript"></mce:script> <body> <form id="form1" runat="server"> <div> Hello everybody:<br /> <a href="http://www.microsoft.com" mce_href="http://www.microsoft.com" type="text/html">www.microsoft.com</a><br /> <a href="http://www.asp.net" mce_href="http://www.asp.net">www.asp.net</a><br /> <input type="text" id="textDisplay" runat="server" /><asp:Button id="Button1" runat="server" Text="Submit" OnClientClick="return click_client()" /> <input id="Checkbox1" type="checkbox" value="Check" /><br /> </div> <img alt="Image/asp.jpg" src="Image/asp.jpg" mce_src="Image/asp.jpg" /> <img alt="Image/asp.jpg" src="Image/asp.jpg" mce_src="Image/asp.jpg" width="100"/> </form> </body> </html>

添加一個Default.aspx頁面我們將從這個頁面中訪問SourcePage並從中提取需要的信息，先來看看它的頁面信息，包括一個多行的TextBox和幾個Button，Button用於獲取頁面的資源信息並且置於TextBox中. 同樣，在頁面頭部的page信息也將加上AutoEventWireup屬性：

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" Inherits="CSASPNETStripHtmlCode.Defaults" %>

Default.aspx (HTML)：

<html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title></title> </head> <body> <form id="form1" runat="server"> <div> <a href="SourcePage.aspx" mce_href="SourcePage.aspx">View the SourcePage.aspx</a><br /> <asp:TextBox ID="tbResult" runat="server" Height="416px" Width="534px" TextMode="MultiLine"></asp:TextBox> <br /> <asp:Button ID="btnRetrieveAll" runat="server" Text="Retrieve entire Html" onclick="btnRetrieveAll_Click" /> <asp:Button ID="btnRetrievePureText" runat="server" Text="Retrieve pure text" onclick="btnRetrievePureText_Click" /> <asp:Button ID="btnRetrieveSriptCode" runat="server" Text="Retrieve sript code" onclick="btnRetrieveSriptCode_Click" /> <asp:Button ID="btnRetrieveImage" runat="server" Text="Retrieve images" onclick="btnRetrieveImage_Click" /> <asp:Button ID="btnRetrievelink" runat="server" Text="Retrieve links" onclick="btnRetrievelink_Click" /> </div> </form> </body> </html>

最後一步，就是寫正則表達式獲取HTML代碼的方法了。

首先我們需要的獲取整個頁面的HTML代碼，通過HttpWebRequest和HttpWebResponse類訪問源頁面的代碼並用StreamReader讀取並返回string類型的變量。

接着我們可以對HTML代碼進行解析和截取，本例中btnRetrievePureText用於獲取純文本，btnRetrieveSriptCode用於獲取腳本信息（不常用），btnRetrieveImage用於獲取圖片信息，btnRetrievelink用於獲取鏈接，當然你可以改變正則表達式的內容和方法，獲取你想要的其他信息：

下面是完整代碼

Default.aspx.cs

public partial class Default : System.Web.UI.Page { string strUrl = String.Empty; string strWholeHtml = string.Empty; const string MsgPageRetrieveFailed = "Sorry, the web page is not run successful"; bool flgPageRetrieved = true; protected void Page_Load(object sender, EventArgs e) { strUrl = this.Page.Request.Url.ToString().Replace("Default","SourcePage"); tbResult.Text = string.Empty; } protected void btnRetrieveAll_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { tbResult.Text = strWholeHtml; } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the entire html code from SourcePage.aspx with WebRequest and /// WebRespond. We transfer the format of html code to uft-8. /// </summary> /// <param name="url"></param> /// <returns></returns> public string GetWholeHtmlCode(string url) { string strHtml = string.Empty; StreamReader strReader = null; HttpWebResponse wrpContent = null; try { HttpWebRequest wrqContent = (HttpWebRequest)WebRequest.Create(strUrl); wrqContent.Timeout = 300000; wrpContent = (HttpWebResponse)wrqContent.GetResponse(); if (wrpContent.StatusCode != HttpStatusCode.OK) { flgPageRetrieved = false; strHtml = "Sorry, the web page is not run successful"; } if (wrpContent != null) { strReader = new StreamReader(wrpContent.GetResponseStream(), Encoding.GetEncoding("utf-8")); strHtml = strReader.ReadToEnd(); } } catch (Exception e) { flgPageRetrieved = false; strHtml = e.Message; } finally { if (strReader != null) strReader.Close(); if (wrpContent != null) wrpContent.Close(); } return strHtml; } /// <summary> /// Retrieve the pure text from html code, this pure text include /// only the Body tags of html. /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrievePureText_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexScript = @"(?m)<body[^>]*>(/w|/W)*?</body[^>]*>"; string strRegex = @"<[^>]*>"; string strMatchScript = string.Empty; Match matchText = Regex.Match(strWholeHtml, strRegexScript, RegexOptions.IgnoreCase); strMatchScript = matchText.Groups[0].Value; string strPureText = Regex.Replace(strMatchScript, strRegex, string.Empty, RegexOptions.IgnoreCase); tbResult.Text = strPureText; } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the script code from html code. /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrieveSriptCode_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexScript = @"(?m)<script[^>]*>(/w|/W)*?</script[^>]*>"; string strRegex = @"<[^>]*>"; string strMatchScript = string.Empty; MatchCollection matchList = Regex.Matches(strWholeHtml, strRegexScript, RegexOptions.IgnoreCase); StringBuilder strbScriptList = new StringBuilder(); foreach (Match matchSingleScript in matchList) { string strSingleScriptText = Regex.Replace(matchSingleScript.Value, strRegex, string.Empty, RegexOptions.IgnoreCase); strbScriptList.Append(strSingleScriptText + "/r/n"); } tbResult.Text = strbScriptList.ToString(); } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the image information from html code /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrieveImage_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexImg = @"(?is)<img.*?>"; MatchCollection matchList = Regex.Matches(strWholeHtml, strRegexImg, RegexOptions.IgnoreCase); StringBuilder strbImageList = new StringBuilder(); foreach (Match matchSingleImage in matchList) { strbImageList.Append(matchSingleImage.Value + "/r/n"); } tbResult.Text = strbImageList.ToString(); } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the links from html code /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrievelink_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexLink = @"(?is)<a .*?>"; MatchCollection matchList = Regex.Matches(strWholeHtml, strRegexLink, RegexOptions.IgnoreCase); StringBuilder strbLinkList = new StringBuilder(); foreach (Match matchSingleLink in matchList) { strbLinkList.Append(matchSingleLink.Value + "/r/n"); } tbResult.Text = strbLinkList.ToString(); } else { tbResult.Text = MsgPageRetrieveFailed; } } }

本例中的兩個重點：

第一，介紹如何使用WebRequest.Create()和WebResponse.GetResponseStream()獲取Web page內容，通過StreamReader.ReadToEnd()方法返回HTML字符串。

第二，使用Regex.Match()和Regex.Replace()兩個基本的方法，獲得指定的內容。至於正則表達式的寫法這裏就不詳細介紹了，可以從網上查看到很多這方面的信息。

這只是一個簡單的獲取和解析HTML代碼的例子，歡迎大家補充指正。

C#使用正則表達式提取網頁中的信息數據

分享5款.NET開源免費的Redis客戶端組件庫

創建 Vue3 項目

golang開發 gorilla websocket的使用

面試官：如果不允許線程池丟棄任務，應該選擇哪個拒絕策略？

記一次 .NET某工業設計軟件崩潰分析

Mac卸載 Node npm，升級 Node

嵌入式汽車電子學習路線

uni.showModel內容換行

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

TS + Webpack 整合 Jest

Silverlight製作音樂播放器

Asp.net 簡單的站內搜索引擎

使用Ajax ModalPopupExtender解決假死問題

如何在ASP.NET使用JavaScript阻止頁面回傳postbacks

String.Format 源字符串包含大括號的小問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結