ChilkatDotNet是一個非常強大的.NET組件,我們可以利用這個組件來做一些網頁搜索的工作,有興趣的朋友可以研究一下.接下來我會使用這個組件編寫一個從網頁收集Email地址的工具.
安裝完ChilkatDotNet之後,在安裝目錄中會有一個dll文件,在項目中引用一下那個dll文件即可開始構建你的程序!
Get Start
This
is a very simple "getting started" example for spidering a web site. As
you'll see in future examples, the Chilkat Spider library can be used
to crawl the Web. For now, we'll concentrate on spidering a single site.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// The spider object crawls a single web site at a time. As you'll see
// in later examples, you can collect outbound links and use them to
// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com");
// Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
// Begin crawling the site by calling CrawlNext repeatedly.
int i;
for (i = 0; i <= 9; i++) {
bool success;
success = spider.CrawlNext();
if (success == true) {
// Show the URL of the page just spidered.
textBox1.Text += spider.LastUrl + "\r\n";
// The HTML is available in the LastHtml property
}
else {
// Did we get an error or are there no more URLs to crawl?
if (spider.NumUnspidered == 0) {
MessageBox.Show("No more URLs to spider");
}
else {
MessageBox.Show(spider.LastErrorText);
}
}
// Sleep 1 second before spidering the next URL.
spider.SleepMs(1000);
}
Extract HTML Title, Description, Keywords
This
example expands on the "getting started" example by showing how to
access the HTML title, description, and keywords within each page
spidered. These are the contents of the META tags for keywords,
description, and title found in the HTML header.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// The spider object crawls a single web site at a time. As you'll see
// in later examples, you can collect outbound links and use them to
// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com");
// Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
// Begin crawling the site by calling CrawlNext repeatedly.
int i;
for (i = 0; i <= 9; i++) {
bool success;
success = spider.CrawlNext();
if (success == true) {
// Show the URL of the page just spidered.
textBox1.Text += spider.LastUrl + "\r\n";
textBox1.Refresh();
// The HTML META keywords, title, and description are available in these properties:
textBox1.Text += spider.LastHtmlTitle + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.LastHtmlDescription + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.LastHtmlKeywords + "\r\n";
textBox1.Refresh();
// The HTML is available in the LastHtml property
}
else {
// Did we get an error or are there no more URLs to crawl?
if (spider.NumUnspidered == 0) {
MessageBox.Show("No more URLs to spider");
}
else {
MessageBox.Show(spider.LastErrorText);
}
}
// Sleep 1 second before spidering the next URL.
spider.SleepMs(1000);
}
Fetch robots.txt for a Site
The Chilkat Spider
library is robots.txt compliant. It automatically fetches a site's
robots.txt file and adheres to it. It will not download pages denied by
robots.txt. Pages excluded by robots.txt will not appear in the
Spider's "unspidered" list. This example shows how to explicitly
download and review the robots.txt for a given site.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
spider.Initialize("www.chilkatsoft.com");
string robotsText;
robotsText = spider.FetchRobotsText();
textBox1.Text += robotsText + "\r\n";
textBox1.Refresh();
Avoid URLs Matching Any of a Set of Patterns
Demonstrates how to
use "avoid patterns" to prevent spidering any URL that matches a
wildcarded pattern. This example avoids URLs containing the substrings
"java", "python", or "perl".
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// The spider object crawls a single web site at a time. As you'll see
// in later examples, you can collect outbound links and use them to
// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com");
// Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
// Avoid URLs matching these patterns:
spider.AddAvoidPattern("*java*");
spider.AddAvoidPattern("*python*");
spider.AddAvoidPattern("*perl*");
// Begin crawling the site by calling CrawlNext repeatedly.
int i;
for (i = 0; i <= 9; i++) {
bool success;
success = spider.CrawlNext();
if (success == true) {
// Show the URL of the page just spidered.
textBox1.Text += spider.LastUrl + "\r\n";
// The HTML is available in the LastHtml property
}
else {
// Did we get an error or are there no more URLs to crawl?
if (spider.NumUnspidered == 0) {
MessageBox.Show("No more URLs to spider");
}
else {
MessageBox.Show(spider.LastErrorText);
}
}
// Sleep 1 second before spidering the next URL.
spider.SleepMs(1000);
}
Setting a Maximum Response Size
The MaxResponseSize
property protects your spider from downloading a page that is too
large. By default, MaxResponseSize = 300,000 bytes. Setting it to 0
indicates that there is no maximum. You may set it to a number
indicating the maximum number of bytes to download. URLs with response
sizes larger than this will be skipped.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
spider.Initialize("www.chilkatsoft.com");
// Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
// This example demonstrates setting the MaxResponseSize property
// Do not download anything with a response size greater than 100,000 bytes.
spider.MaxResponseSize = 100000;
Setting a Maximum URL Length
The MaxUrlLen property prevents the spider from retrieving URLs that grow too long. The default value of MaxUrlLen is 300.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
spider.Initialize("www.chilkatsoft.com");
// Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
// This example demonstrates setting the MaxUrlLen property
// Do not add URLs longer than 250 characters to the "unspidered" queue:
spider.MaxUrlLen = 250;
// ...
Using the Disk Cache
The Chilkat Spider
component has disk caching capabilities. To setup a disk cache, create
a new directory anywhere on your local hard drive and set the CacheDir
property to the path. For example, you might create "c:/spiderCache/".
The UpdateCache property controls whether downloaded pages are saved to
the cache. The FetchFromCache property controls whether the cache is
first checked for pages. The LastFromCache property tells whether the
last URL fetched came from cache or not.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// Set our cache directory and make sure saving-to-cache and fetching-from-cache
// are both turned on:
spider.CacheDir = "c:/spiderCache/";
spider.FetchFromCache = true;
spider.UpdateCache = true;
// If you run this code twice, you'll find that the 2nd run is extremely fast
// because the pages will be retrieved from cache.
// The spider object crawls a single web site at a time. As you'll see
// in later examples, you can collect outbound links and use them to
// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com");
// Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
// Begin crawling the site by calling CrawlNext repeatedly.
int i;
for (i = 0; i <= 9; i++) {
bool success;
success = spider.CrawlNext();
if (success == true) {
// Show the URL of the page just spidered.
textBox1.Text += spider.LastUrl + "\r\n";
// The HTML is available in the LastHtml property
}
else {
// Did we get an error or are there no more URLs to crawl?
if (spider.NumUnspidered == 0) {
MessageBox.Show("No more URLs to spider");
}
else {
MessageBox.Show(spider.LastErrorText);
}
}
// Sleep 1 second before spidering the next URL.
// The reason for waiting a short time before the next fetch is to prevent
// undue stress on the web server. However, if the last page was retrieved
// from cache, there is no need to pause.
if (spider.LastFromCache != true) {
spider.SleepMs(1000);
}
}
Crawling the Web
If the Chilkat
Spider component only crawls a single site, how do you crawl the Web?
The answer is simple: as you crawl a site, the spider collects outbound
links and makes them accessible to you. You may then instantiate an
instance of the Spider object for each site, and crawl it. The task of
keeping track of what sites you've already crawled is left to you (for
now). This example retrieves the home page of
[url]http://www.joelonsoftware.com/[/url] and displays the outbound links.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// The Initialize method may be called with just the domain name,
// such as "www.joelonsoftware.com" or a full URL. If you pass only
// the domain name, you must add URLs to the unspidered list by calling
// AddUnspidered. Otherwise, the URL you pass to Initialize is the 1st
// URL in the unspidered list.
spider.Initialize("www.joelonsoftware.com");
spider.AddUnspidered("http://www.joelonsoftware.com/");
bool success;
success = spider.CrawlNext();
int i;
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
textBox1.Text += spider.GetOutboundLink(i) + "\r\n";
textBox1.Refresh();
}
Get Referenced Domains
Demonstrates how to accumulate a list of unique domain names referenced from outbound URLs.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
Chilkat.StringArray domainList = new Chilkat.StringArray();
// Set the Unique property so that duplicates are not added.
domainList.Unique = true;
// Crawl the home page of joelonsoftware.com and get the outbound URLs
spider.Initialize("www.joelonsoftware.com");
spider.AddUnspidered("http://www.joelonsoftware.com/");
bool success;
success = spider.CrawlNext();
// Build a list of unique domains.
int i;
string url;
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
url = spider.GetOutboundLink(i);
domainList.Append(spider.GetDomain(url));
}
// Display the domains.
for (i = 0; i <= domainList.Count - 1; i++) {
textBox1.Text += domainList.GetString(i) + "\r\n";
textBox1.Refresh();
}
Get Base Domains
Demonstrates how to accumulate a list of unique domain names referenced from outbound URLs.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
Chilkat.StringArray domainList = new Chilkat.StringArray();
// Set the Unique property so that duplicates are not added.
domainList.Unique = true;
// Crawl the home page of joelonsoftware.com and get the outbound URLs
spider.Initialize("www.joelonsoftware.com");
spider.AddUnspidered("http://www.joelonsoftware.com/");
bool success;
success = spider.CrawlNext();
// Build a list of unique domains.
int i;
string url;
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
url = spider.GetOutboundLink(i);
domainList.Append(spider.GetDomain(url));
}
// Display the domains.
for (i = 0; i <= domainList.Count - 1; i++) {
textBox1.Text += domainList.GetString(i) + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain(domainList.GetString(i))
+ "\r\n" + "\r\n";
textBox1.Refresh();
}
GetBaseDomain
The GetBaseDomain
method is a utility function that converts a domain into a "domain
base", which is useful for grouping URLs. For example:
abc.chilkatsoft.com, xyz.chilkatsoft.com, and blog.chilkatsoft.com all
have the same base domain: chilkatsoft.com. Things get more complicated
when considering country domains (.au, .uk, .se, .cn, etc.) and
government, state, and .us domains. Also, domains such as blogspot,
tripod, geocities, wordpress, etc, are treated specially so that
"xyz.blogspot.com" has a base domain of "xyz.blogspot.com". Note: If
you find other domains that should be treated similarly to
blogspot.com, send a request to [email][email protected][/email].
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
textBox1.Text += spider.GetBaseDomain("www.chilkatsoft.com") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("blog.chilkatsoft.com") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("www.news.com.au") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("blogs.bbc.co.uk") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("xyz.blogspot.com") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("www.heaids.org.za") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("www.hec.gov.pk") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("www.e-mrs.org") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.GetBaseDomain("cra.curtin.edu.au") + "\r\n";
textBox1.Refresh();
// Prints:
// chilkatsoft.com
// chilkatsoft.com
// news.com.au
// bbc.co.uk
// xyz.blogspot.com
// heaids.org.za
// hec.gov.pk
// e-mrs.org
// curtin.edu.au
CanonicalizeUrl
The CanonicalizeUrl
method is a utility function that canonicalizes a URL into a standard
form to avoid duplicates. For example, "http://www.chilkatsoft.com/"
and "http://www.chilkatsoft.com/default.asp" are the same URL.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// Does a DNS lookup to find the default domain, which may or may not include the "www." depending on the DNS results.
// Also domain names are converted to lowercase:
textBox1.Text += spider.CanonicalizeUrl("http://www.ChilkatSoft.com/") + "\r\n";
textBox1.Refresh();
// CanonicalizeUrl will drop the HTML fragment:
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com/purchase2.asp#buyZip") + "\r\n";
textBox1.Refresh();
// If a username/password is in the URL, it gets dropped:
textBox1.Text += spider.CanonicalizeUrl("http://username:password@[url]www.chilkatsoft.com/purchase2.asp#buyZip[/url]") + "\r\n";
textBox1.Refresh();
// Port 80 and 443 are dropped:
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com:80/purchase2.asp") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.CanonicalizeUrl("https://www.paypal.com:443/") + "\r\n";
textBox1.Refresh();
// Removes default pages:
// default.asp, index.html, index.htm, default.html, index.htm, default.htm
// index.php, index.asp, default.php, .cfm, .aspx, ,php3, .pl, .cgi, .txt, .shtml, .phtml
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.asp") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.asp") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.php") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.pl") + "\r\n";
textBox1.Refresh();
textBox1.Text += spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.htm") + "\r\n";
textBox1.Refresh();
// Output:
// [url]http://chilkatsoft.com/[/url]
// [url]http://chilkatsoft.com/purchase2.asp[/url]
// [url]http://chilkatsoft.com/purchase2.asp[/url]
// [url]http://chilkatsoft.com/purchase2.asp[/url]
// [url]https://www.paypal.com/[/url]
// [url]http://chilkatsoft.com/[/url]
// [url]http://chilkatsoft.com/[/url]
// [url]http://chilkatsoft.com/[/url]
// [url]http://chilkatsoft.com/[/url]
// [url]http://chilkatsoft.com/[/url]
Avoiding Outbound Links Matching Patterns
The spider
accumulates outbound links when crawling. Your program may specify any
number of "avoid patterns" to prevent any link matching at least one of
the wildcarded patterns from being added.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// First, we'll get the outbound links for a page in the
// Google directory. Then we'll add some avoid patterns
// and then re-fetch, to see it work...
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/");
bool success;
success = spider.CrawlNext();
// Display the outbound links
int i;
string url;
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
textBox1.Text += spider.GetOutboundLink(i) + "\r\n";
}
// The output:
// [url]http://www.cheese.com/[/url]
// [url]http://www.cheesediaries.com/[/url]
// [url]http://www.WisDairy.com/[/url]
// [url]http://www.newenglandcheese.com[/url]
// [url]http://www.ilovecheese.com[/url]
// [url]http://www.cheesefromspain.com[/url]
// [url]http://www.realcaliforniacheese.com/[/url]
// [url]http://www.frencheese.co.uk/[/url]
// [url]http://www.cheesesociety.org/[/url]
// [url]http://www.specialcheese.com/queso.htm[/url]
// [url]http://www.franceway.com/cheese/intro.htm[/url]
// [url]http://www.foodsubs.com/Chesfirm.html[/url]
// [url]http://www.cheeseboard.co.uk/[/url]
// [url]http://www.thecheeseweb.com/[/url]
// [url]http://www.vtcheese.com/[/url]
// [url]http://www.coldbacon.com/cheese.html[/url]
// [url]http://www.norwegiancheeses.co.uk/[/url]
// [url]http://www.reluctantgourmet.com/cheese.htm[/url]
// [url]http://www.lancewood.co.za/[/url]
// [url]http://www.switzerlandcheese.ca[/url]
// [url]http://www.frenchcheese.dk/[/url]
// [url]http://www.dolcevita.com/cuisine/cheese/cheese.htm[/url]
// [url]http://cheeseisland.net/[/url]
// [url]http://www.cheestrings.ca/[/url]
// [url]http://www.dreamcheese.co.uk[/url]
// [url]http://hgic.clemson.edu/factsheets/HGIC3506.htm[/url]
// [url]http://www.epicurious.com/cooking/how_to/food_dictionary/entry?id=1815[/url]
// [url]http://www.mousetrapcheese.co.uk[/url]
// [url]http://taquitos.net/yum/gc.shtml[/url]
// [url]http://www.greek-recipe.com/static/greek-cheese[/url]
// [url]http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html[/url]
// [url]http://www.dairyfarmers.org/engl/recipes/4_1.asp[/url]
// [url]http://www.prairieridgecheese.com/wischeesguid.html[/url]
// [url]http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Food/Cheese[/url]
// [url]http://dmoz.org/about.html[/url]
// [url]http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Food/Cheese[/url]
// Do it again, but this time with avoid patterns.
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/");
// Add some avoid patterns:
spider.AddAvoidOutboundLinkPattern("*dmoz.org*");
spider.AddAvoidOutboundLinkPattern("*?id=*");
spider.AddAvoidOutboundLinkPattern("*.co.uk*");
success = spider.CrawlNext();
textBox1.Text += "-----------------------" + "\r\n";
// Display the outbound links
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
textBox1.Text += spider.GetOutboundLink(i) + "\r\n";
}
// Output:
// [url]http://www.cheese.com/[/url]
// [url]http://www.cheesediaries.com/[/url]
// [url]http://www.WisDairy.com/[/url]
// [url]http://www.newenglandcheese.com[/url]
// [url]http://www.ilovecheese.com[/url]
// [url]http://www.cheesefromspain.com[/url]
// [url]http://www.realcaliforniacheese.com/[/url]
// [url]http://www.cheesesociety.org/[/url]
// [url]http://www.specialcheese.com/queso.htm[/url]
// [url]http://www.franceway.com/cheese/intro.htm[/url]
// [url]http://www.foodsubs.com/Chesfirm.html[/url]
// [url]http://www.thecheeseweb.com/[/url]
// [url]http://www.vtcheese.com/[/url]
// [url]http://www.coldbacon.com/cheese.html[/url]
// [url]http://www.reluctantgourmet.com/cheese.htm[/url]
// [url]http://www.lancewood.co.za/[/url]
// [url]http://www.switzerlandcheese.ca[/url]
// [url]http://www.frenchcheese.dk/[/url]
// [url]http://www.dolcevita.com/cuisine/cheese/cheese.htm[/url]
// [url]http://cheeseisland.net/[/url]
// [url]http://www.cheestrings.ca/[/url]
// [url]http://hgic.clemson.edu/factsheets/HGIC3506.htm[/url]
// [url]http://taquitos.net/yum/gc.shtml[/url]
// [url]http://www.greek-recipe.com/static/greek-cheese[/url]
// [url]http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html[/url]
// [url]http://www.dairyfarmers.org/engl/recipes/4_1.asp[/url]
// [url]http://www.prairieridgecheese.com/wischeesguid.html[/url]
Must-Match Patterns
You may restrict
the spider to only follow links that match any one of a set of
"must-match" wildcard patterns. The AddMustMatchPattern can be called
repeatedly to add must-match patterns.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
// First, we'll get the outbound links for a page in the
// Google directory. Then we'll add some must-match
// and then re-fetch, to see it work...
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");
bool success;
success = spider.CrawlNext();
// Display the outbound links
int i;
string url;
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
textBox1.Text += spider.GetOutboundLink(i) + "\r\n";
textBox1.Refresh();
}
// The output:
// [url]http://www.backpacker.com[/url]
// [url]http://www.cmc.org[/url]
// [url]http://www.backpacking.net[/url]
// [url]http://www.thebackpacker.com/[/url]
// [url]http://www.rei.com/online/store/LearnShareArticlesList?categoryId=Camping[/url]
// [url]http://www.trailspace.com/[/url]
// [url]http://www.catskillhikes.com/[/url]
// [url]http://gorp.away.com/gorp/location/asia/nepal/favpicks.htm[/url]
// [url]http://www.backpackinglight.com/cgi-bin/backpackinglight/index.html[/url]
// [url]http://www.yetizone.com/[/url]
// [url]http://www.backpackingfun.com[/url]
// [url]http://www.freezerbagcooking.com/[/url]
// [url]http://www.spadout.com/backpacking/[/url]
// [url]http://sierrabackpacker.com[/url]
// [url]http://www.abovecalifornia.com/[/url]
// [url]http://www.personal.psu.edu/faculty/r/p/rpc1/bbb/[/url]
// [url]http://www.thebackpackersguide.com[/url]
// [url]http://www.journeywest.com/WB/index.html[/url]
// [url]http://www.johann-sandra.com/backpackdir.htm[/url]
// [url]http://www.geocities.com/amytys/[/url]
// [url]http://www.cloudwalkersbasecamp.com[/url]
// [url]http://www.netbackpacking.com[/url]
// [url]http://members.tripod.com/~stooges/[/url]
// [url]http://www.thebackpackingsite.com[/url]
// [url]http://www.thruhikers.com/[/url]
// [url]http://www.redcompservices.com/AT/[/url]
// [url]http://members.aol.com/CMorHiker/backpack[/url]
// [url]http://mywebpages.comcast.net/midwestpacker/[/url]
// [url]http://www.midwesthiker.com/[/url]
// [url]http://www.WeBackpack.com[/url]
// [url]http://www.michiganhiker.com[/url]
// [url]http://www.host33.com/backpack/[/url]
// [url]http://www.wilderness-backpacking.com[/url]
// [url]http://www.thetravelmonkey.net[/url]
// [url]http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Outdoors/Hiking/Backpacking[/url]
// [url]http://dmoz.org/about.html[/url]
// [url]http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Outdoors/Hiking/Backpacking[/url]
// [url]http://dmoz.org[/url]
// [url]http://dmoz.org/profiles/cdog.html[/url]
// [url]http://dmoz.org/profiles/justinwp.html[/url]
// Do it again, but this time with avoid patterns.
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");
// Add some must-match patterns:
spider.AddMustMatchPattern("*.com/*");
spider.AddMustMatchPattern("*.net/*");
// Add some avoid-patterns:
spider.AddAvoidOutboundLinkPattern("*.mypages.*");
spider.AddAvoidOutboundLinkPattern("*.personal.*");
spider.AddAvoidOutboundLinkPattern("*.comcast.*");
spider.AddAvoidOutboundLinkPattern("*.aol.*");
spider.AddAvoidOutboundLinkPattern("*~*");
success = spider.CrawlNext();
textBox1.Text += "-----------------------" + "\r\n";
textBox1.Refresh();
// Display the outbound links
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
textBox1.Text += spider.GetOutboundLink(i) + "\r\n";
textBox1.Refresh();
}
// Output:
// [url]http://www.thebackpacker.com/[/url]
// [url]http://www.rei.com/online/store/LearnShareArticlesList?categoryId=Camping[/url]
// [url]http://www.trailspace.com/[/url]
// [url]http://www.catskillhikes.com/[/url]
// [url]http://gorp.away.com/gorp/location/asia/nepal/favpicks.htm[/url]
// [url]http://www.backpackinglight.com/cgi-bin/backpackinglight/index.html[/url]
// [url]http://www.yetizone.com/[/url]
// [url]http://www.freezerbagcooking.com/[/url]
// [url]http://www.spadout.com/backpacking/[/url]
// [url]http://www.abovecalifornia.com/[/url]
// [url]http://www.journeywest.com/WB/index.html[/url]
// [url]http://www.johann-sandra.com/backpackdir.htm[/url]
// [url]http://www.geocities.com/amytys/[/url]
// [url]http://www.thruhikers.com/[/url]
// [url]http://www.redcompservices.com/AT/[/url]
// [url]http://www.midwesthiker.com/[/url]
// [url]http://www.host33.com/backpack/[/url]
A Simple Web Crawler
This demonstrates a very simple web crawler using the Chilkat Spider component.
// The Chilkat Spider component/library is free.
Chilkat.Spider spider = new Chilkat.Spider();
Chilkat.StringArray seenDomains = new Chilkat.StringArray();
Chilkat.StringArray seedUrls = new Chilkat.StringArray();
seenDomains.Unique = true;
seedUrls.Unique = true;
seedUrls.Append("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");
// Set our outbound URL exclude patterns
spider.AddAvoidOutboundLinkPattern("*?id=*");
spider.AddAvoidOutboundLinkPattern("*.mypages.*");
spider.AddAvoidOutboundLinkPattern("*.personal.*");
spider.AddAvoidOutboundLinkPattern("*.comcast.*");
spider.AddAvoidOutboundLinkPattern("*.aol.*");
spider.AddAvoidOutboundLinkPattern("*~*");
// Use a cache so we don't have to re-fetch URLs previously fetched.
spider.CacheDir = "c:/spiderCache/";
spider.FetchFromCache = true;
spider.UpdateCache = true;
while (seedUrls.Count > 0) {
string url;
url = seedUrls.Pop();
spider.Initialize(url);
// Spider 5 URLs of this domain.
// but first, save the base domain in seenDomains
string domain;
domain = spider.GetDomain(url);
seenDomains.Append(spider.GetBaseDomain(domain));
int i;
bool success;
for (i = 0; i <= 4; i++) {
success = spider.CrawlNext();
if (success != true) {
break;
}
// Display the URL we just crawled.
textBox1.Text += spider.LastUrl + "\r\n";
// If the last URL was retrieved from cache,
// we won't wait. Otherwise we'll wait 1 second
// before fetching the next URL.
if (spider.LastFromCache != true) {
spider.SleepMs(1000);
}
}
// Add the outbound links to seedUrls, except
// for the domains we've already seen.
for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {
url = spider.GetOutboundLink(i);
domain = spider.GetDomain(url);
string baseDomain;
baseDomain = spider.GetBaseDomain(domain);
if (!seenDomains.Contains(baseDomain)) {
seedUrls.Append(url);
}
// Don't let our list of seedUrls grow too large.
if (seedUrls.Count > 1000) {
break;
}
}
}