http://blog.csdn.net/shijun_zhang/article/details/6577426
1、前言:
在網絡編程中,常常要使用域名轉換爲IP的操作,這個時候就需要用到域名解析。域名解析是一個垂直請求的過程,具體如下圖。
2、gethostbyname的性能瓶頸
Unix/Linux下的gethostbyname函數常用來向DNS查詢一個域名的IP地址。 由於DNS的遞歸查詢,常常會發生gethostbyname函數在查詢一個域名時嚴重超時。而該函數又不能像connect和read等函數那樣通過setsockopt或者select函數那樣設置超時時間,因此常常成爲程序的瓶頸。有人提出一種解決辦法是用alarm設置定時信號,如果超時就用setjmp和longjmp跳過gethostbyname函數(這種方式我沒有試過,不知道具體效果如何)。
在多線程下面,gethostbyname會一個更嚴重的問題,就是如果有一個線程的gethostbyname發生阻塞,其它線程都會在gethostbyname處發生阻塞。我在編寫爬蟲時也遇到了這個讓我疑惑很久的問題,所有的爬蟲線程都阻塞在gethostbyname處,導致爬蟲速度非常慢。在網上google了很長時間這個問題,也沒有找到解答。今天湊巧在實驗室的googlegroup裏面發現了一本電子書"Mining the Web - Discovering Knowledge from Hypertext
Data",其中在講解爬蟲時有下面幾段文字:
Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently
handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided
client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research Center reduced the time
spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.
In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with
crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted.
Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP instead
of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on.
大意是說unix的gethostbyname無法處理在併發程序下使用,這是先天的缺陷是無法改變的。大型爬蟲往往不會使用gethostbyname,而是實現自己獨立定製的DNS客戶端。這樣可以實現DNS的負載平衡,而且通過異步解析能夠大大提高DNS解析速度。DNS客戶端往往用UDP實現,可以在爬蟲爬取網頁前提前解析URL的IP。文章中還提到了一個開源的異步DNS庫adns,主頁是http://www.chiark.greenend.org.uk/~ian/adns/
從以上可看出,gethostbyname並不適用於多線程環境以及其它對DNS解析速度要求較高的程序。
3、方法一:linux GNU gethostbyname_r
此方法支持多線程,單機測試可以達到100次/s。
參數說明:name——是網頁的host名稱,如百度的host名是www.baidu.com
ret——成功的情況下存儲結果用。
buf——這是一個臨時的緩衝區,用來存儲過程中的各種信息,一般8192大小就夠了,可以申請一個數組char buf[8192]
buflen——是buf緩衝區的大小
result——如果成功,則這個hostent指針指向ret,也就是正確的結果;如果失敗,則result爲NULL
h_errnop——存儲錯誤碼
該函數成功返回0,失敗返回一個非0的數。
struct hostent {
char *h_name; // official name of host
char **h_aliases; // alias list
int h_addrtype; // host address type——AF_INET || AF_INET6
int h_length; // length of address
char **h_addr_list; // list of addresses
};
#define h_addr h_addr_list[0] // for backward compatibility
4、方法二:自己寫client端請求
主要關注字段爲TC字段,當TC字段爲1時,表示應答總長度超過512字節,只返回前512個字節,這時DNS就需要使用TCP重發原來的查詢請求。因爲在UDP的應用程序中,其應用程序被限制在512個字節或更小,因此DNS報文穿數據流只能有512字節,而TCP能將用戶的數據流分爲一些報文段,因此TCP就能用多個報文段去傳超過512字節的數據流或是任意長度的數據流。
大多數書只寫DNS使用UDP 53端口,這並不完整,會導致別人誤解,認爲DNS只用UDP,不用TCP,呵呵。
二、應用角度來看