DNS的解析雜談,以及gethostbyname的弊端

http://blog.csdn.net/shijun_zhang/article/details/6577426

1、前言:

 

在網絡編程中,常常要使用域名轉換爲IP的操作,這個時候就需要用到域名解析。域名解析是一個垂直請求的過程,具體如下圖。

 

2、gethostbyname的性能瓶頸

 

Unix/Linux下的gethostbyname函數常用來向DNS查詢一個域名的IP地址。 由於DNS的遞歸查詢,常常會發生gethostbyname函數在查詢一個域名時嚴重超時。而該函數又不能像connect和read等函數那樣通過setsockopt或者select函數那樣設置超時時間,因此常常成爲程序的瓶頸。有人提出一種解決辦法是用alarm設置定時信號,如果超時就用setjmp和longjmp跳過gethostbyname函數(這種方式我沒有試過,不知道具體效果如何)。
    在多線程下面,gethostbyname會一個更嚴重的問題,就是如果有一個線程的gethostbyname發生阻塞,其它線程都會在gethostbyname處發生阻塞。我在編寫爬蟲時也遇到了這個讓我疑惑很久的問題,所有的爬蟲線程都阻塞在gethostbyname處,導致爬蟲速度非常慢。在網上google了很長時間這個問題,也沒有找到解答。今天湊巧在實驗室的googlegroup裏面發現了一本電子書"Mining the Web - Discovering Knowledge from Hypertext Data",其中在講解爬蟲時有下面幾段文字:

    Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.
    In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP  instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on.

    大意是說unix的gethostbyname無法處理在併發程序下使用,這是先天的缺陷是無法改變的。大型爬蟲往往不會使用gethostbyname,而是實現自己獨立定製的DNS客戶端。這樣可以實現DNS的負載平衡,而且通過異步解析能夠大大提高DNS解析速度。DNS客戶端往往用UDP實現,可以在爬蟲爬取網頁前提前解析URL的IP。文章中還提到了一個開源的異步DNS庫adns,主頁是http://www.chiark.greenend.org.uk/~ian/adns/
    從以上可看出,gethostbyname並不適用於多線程環境以及其它對DNS解析速度要求較高的程序。

 

3、方法一:linux GNU gethostbyname_r


此方法支持多線程,單機測試可以達到100次/s

參數說明:name——是網頁的host名稱,如百度的host名是www.baidu.com
                  ret——成功的情況下存儲結果用。
                  buf——這是一個臨時的緩衝區,用來存儲過程中的各種信息,一般8192大小就夠了,可以申請一個數組char buf[8192]
                  buflen——是buf緩衝區的大小
                  result——如果成功,則這個hostent指針指向ret,也就是正確的結果;如果失敗,則result爲NULL
                  h_errnop——存儲錯誤碼
該函數成功返回0,失敗返回一個非0的數。

struct hostent {
         char *h_name;                     // official name of host
         char **h_aliases;                 // alias list
         int h_addrtype;                    // host address type——AF_INET || AF_INET6
         int h_length;                        // length of address
         char **h_addr_list;              // list of addresses
};
#define h_addr h_addr_list[0]      // for backward compatibility

 

4、方法二:自己寫client端請求

 

一、DNS報文角度來看
主要是查看DNS報文首部中的標誌字段
[QR][opcode][AA][TC][RD][RA][(zone)][rcode]
 
主要關注字段爲TC字段,當TC字段爲1時,表示應答總長度超過512字節,只返回前512個字節,這時DNS就需要使用TCP重發原來的查詢請求。因爲在UDP的應用程序中,其應用程序被限制在512個字節或更小,因此DNS報文穿數據流只能有512字節,而TCP能將用戶的數據流分爲一些報文段,因此TCP就能用多個報文段去傳超過512字節的數據流或是任意長度的數據流。
 
大多數書只寫DNS使用UDP 53端口,這並不完整,會導致別人誤解,認爲DNS只用UDP,不用TCP,呵呵。
 
二、應用角度來看
區域傳輸用TCP,其他用UDP。
什麼是區域傳輸?
dns的規範規定了2種類型的dns服務器,一個叫主dns服務器,一個叫輔助dns服務器。在一個區中主dns服務器從自己本機的數據文件中讀取該區的dns數據信息,而輔助dns服務器則從區的權威dns服務器中讀取該區的dns數據信息。當一個輔助dns服務器啓動時,它需要與主dns服務器通信,並加載數據信息,這就叫做區傳送(zone transfer).
通俗地講,就是DNS服務器之間傳輸時使用TCP,而客戶端與DNS服務器之間傳輸時用的是UDP
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章