使用.NET組件編寫郵箱蒐集工具

        前面轉載了一篇文章介紹ChilkatDotNet組件的使用,下面我將利用這個組件編寫一個從網頁蒐集Email的工具.

       從網頁中搜集信息有兩個難點需要解決:一是編寫可以通過鏈接遍歷網頁的蜘蛛程序,這點ChilkatDotNet組件已經給我們提供了很好的支持.二是從網頁中提取需要的信息,這點可以通過很多方式解決,這裏我選擇的是正則表達式.
       先給一張程序運行時的截圖:
      

       界面的設計很簡單,3個Textbox+1個RichTextBox+2個Button,3個Textbox分別用來輸入站點地址,起始Url和需要遍歷的鏈接數,RichTextBox用來存放蒐集到的網頁信息,這裏我保存的是網頁url和網頁中的Email地址.
       程序主要分爲兩部分,首先是遍歷站點,代碼如下:
      

         Chilkat.Spider spider = new Chilkat.Spider();

            string website = this.textWebsite.Text;

            string url = this.textUrl.Text;

            int links = Int32.Parse(this.textLinks.Text);

            //  The spider object crawls a single web site at a time.  As you'll see

            //  in later examples, you can collect outbound links and use them to

            //  crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com

            spider.Initialize(website);


            //  Add the 1st URL:

            spider.AddUnspidered(url);


            //  Begin crawling the site by calling CrawlNext repeatedly.

            int i;

            for (i = 0; i <= links; i++)
            {

                bool success;

                success = spider.CrawlNext();

                if (success == true)
                {
                    Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" });
                    GetAllURL(spider.LastUrl.ToString());
                }

                else
                {

                    //  Did we get an error or are there no more URLs to crawl?

                    if (spider.NumUnspidered == 0)
                    {

                        MessageBox.Show("No more URLs to spider");

                    }

                    else
                    {

                        MessageBox.Show(spider.LastErrorText);

                    }

                }


                //  Sleep 1 second before spidering the next URL.

                spider.SleepMs(1000);

            }

       和ChilkatDotNet裏的示例代碼相似,只是增加了從文本框獲取初始條件的代碼.獲取Url地址後,需要提取網頁的內容,再根據正則表達式獲取Email地址.
       獲取網頁內容:
HttpWebRequest webRequest1 = (HttpWebRequest)WebRequest.Create(new Uri(URlStr));
            webRequest1.Method = "GET";
            HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse();
            Stream stream = response.GetResponseStream();
            StreamReader streamReader = new StreamReader(stream, Encoding.Default);
            String textData = streamReader.ReadToEnd();
            streamReader.Close();
            response.Close();
     提取Email的正則表達式:
@"(?<EmailStr>\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b)"

     關於正則表達式的用法,網上有很多教程,隨便找一個學習一下就行.
     這裏我只蒐集了單個站點的Email地址,利用ChilkatDotNet組件不難做到蒐集整個網絡的信息,有興趣的朋友可以自己研究一下.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章