前面轉載了一篇文章介紹ChilkatDotNet組件的使用,下面我將利用這個組件編寫一個從網頁蒐集Email的工具.
從網頁中搜集信息有兩個難點需要解決:一是編寫可以通過鏈接遍歷網頁的蜘蛛程序,這點ChilkatDotNet組件已經給我們提供了很好的支持.二是從網頁中提取需要的信息,這點可以通過很多方式解決,這裏我選擇的是正則表達式.
先給一張程序運行時的截圖:
界面的設計很簡單,3個Textbox+1個RichTextBox+2個Button,3個Textbox分別用來輸入站點地址,起始Url和需要遍歷的鏈接數,RichTextBox用來存放蒐集到的網頁信息,這裏我保存的是網頁url和網頁中的Email地址.
程序主要分爲兩部分,首先是遍歷站點,代碼如下:
Chilkat.Spider spider = new Chilkat.Spider();
string website = this.textWebsite.Text;
string url = this.textUrl.Text;
int links = Int32.Parse(this.textLinks.Text);
// The spider object crawls a single web site at a time. As you'll see
// in later examples, you can collect outbound links and use them to
// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize(website);
// Add the 1st URL:
spider.AddUnspidered(url);
// Begin crawling the site by calling CrawlNext repeatedly.
int i;
for (i = 0; i <= links; i++)
{
bool success;
success = spider.CrawlNext();
if (success == true)
{
Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" });
GetAllURL(spider.LastUrl.ToString());
}
else
{
// Did we get an error or are there no more URLs to crawl?
if (spider.NumUnspidered == 0)
{
MessageBox.Show("No more URLs to spider");
}
else
{
MessageBox.Show(spider.LastErrorText);
}
}
// Sleep 1 second before spidering the next URL.
spider.SleepMs(1000);
}
string website = this.textWebsite.Text;
string url = this.textUrl.Text;
int links = Int32.Parse(this.textLinks.Text);
// The spider object crawls a single web site at a time. As you'll see
// in later examples, you can collect outbound links and use them to
// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize(website);
// Add the 1st URL:
spider.AddUnspidered(url);
// Begin crawling the site by calling CrawlNext repeatedly.
int i;
for (i = 0; i <= links; i++)
{
bool success;
success = spider.CrawlNext();
if (success == true)
{
Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" });
GetAllURL(spider.LastUrl.ToString());
}
else
{
// Did we get an error or are there no more URLs to crawl?
if (spider.NumUnspidered == 0)
{
MessageBox.Show("No more URLs to spider");
}
else
{
MessageBox.Show(spider.LastErrorText);
}
}
// Sleep 1 second before spidering the next URL.
spider.SleepMs(1000);
}
和ChilkatDotNet裏的示例代碼相似,只是增加了從文本框獲取初始條件的代碼.獲取Url地址後,需要提取網頁的內容,再根據正則表達式獲取Email地址.
獲取網頁內容:
HttpWebRequest webRequest1 = (HttpWebRequest)WebRequest.Create(new Uri(URlStr));
webRequest1.Method = "GET";
HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader streamReader = new StreamReader(stream, Encoding.Default);
String textData = streamReader.ReadToEnd();
streamReader.Close();
response.Close();
webRequest1.Method = "GET";
HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader streamReader = new StreamReader(stream, Encoding.Default);
String textData = streamReader.ReadToEnd();
streamReader.Close();
response.Close();
@"(?<EmailStr>\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b)"
關於正則表達式的用法,網上有很多教程,隨便找一個學習一下就行.
這裏我只蒐集了單個站點的Email地址,利用ChilkatDotNet組件不難做到蒐集整個網絡的信息,有興趣的朋友可以自己研究一下.