Nutch搜索引擎之分佈式文件系統

1.介紹

NDFS：在一系列機器上存儲龐大的面向流的文件，包含多機的存儲冗餘和負載均衡。文件以塊爲單位存儲在NDFS的離散機器上，提供一個傳統的input/output流接口用於文件讀寫。塊的查找以及數據在網絡上傳輸等細節由NDFS自動完成，對用戶是透明的。而且NDFS能很好地處理用於存儲的機器序列，能方便地添加和刪除一臺機器。當某臺機器不可用時，NDFS自動的保證文件的可用性。只要網上的機器序列能提供足夠的存儲空間，就要保證NDFS文件系統的正常運作。 NDFS是建立在普通磁盤上的，不需要RAID控制器或者其它的磁盤陣列解決方案。

2.語法

1). 文件只能寫一次，寫完之後，就變成只讀了（但是可以被刪除） 2). 文件是面向流的，只能在文件末尾加字節流，而且只能讀寫指針只能遞增。 3). 文件沒有存儲訪問的控制

所以，所有對NDFS的訪問都是通過驗證的客戶代碼。沒有提供API供其它程序訪問。因此Nutch就是NDFS的模擬用戶。

3.系統設計 NDFS包含兩種類型的機器：NameNodes和DataNodes： NameNodes維護名字空間；而DataNodes存儲數據塊。NDFS中包含一個NamdNode，而包含任意多的DataNodes，每個DataNodes都配置與唯一的NameNode 通信。 1)NameNode: 負責存儲整個名字空間和文件系統的佈局。是一個關鍵點，不能down掉。但是做的工作不多，因此不是一個負載的瓶頸。維護一張保存在磁盤上的表： filename-0->BlockID_A,BlockID_B...BlockID_X,etc.; filename就是一字符串，BolockID是唯一的標識符。每個filename有任意個blocks。 2)DataNode:負責存儲數據。一個塊應該在多個DataNode中有備份；而一個DataNode對於一個塊最多只包含一個備份。維護一張表：BlockID_X->array of bytes..

3)合作：DataNode在啓動後，就主動與NameNode通信，將本地的Block信息告知NameNode。NameNode據此可以構造一顆樹，描述如何找到NDFS中的Blocks。這顆樹是實時更新的。DataNode會定期發送信息給 NameNode，以證明自己的存在，當NameNode收不到該信息時，就會認爲DataNode已經down了。

4)文件的讀寫過程：例如Client要讀取foo.txt，則有以下過程。 a.Client通過網絡聯繫NameNode，提交filename:"foo.txt" b.Client收到從NameNode來的回覆，包含：組成"foo.txt"的文件塊和每個塊存在的DataNode序列。 c.Client依次讀取每個文件塊。對於一個文件塊，Client從它的DataNode序列中得到合適的DataNode，然後發送請求給DataNode，由DataNode將數據傳輸給Client

4.系統的可用性

NDFS的可用性取決於Blocks的冗餘度，即應該在多少個DataNode保持同一Block的備份。對於有條件的話可以設置3個備份和2個最低備份(DESIRED_REPLICATION and MIN_REPLICATION constants in fs.FSNamesystem)。當一個塊的低於MIN_REPLICATION,NameNode就會指導DataNode做新的備份。

5.net.nutch.fs包的一些文件介紹 1)NDFS.java:包含兩個main函數，一個是關於NameNode的，一個是關於DataNode的 2)FSNamesystem.java:維護名字空間，包含了NameNode的功能，比如如何尋找Blocks，可用的DataNode序列 3)FSDirectory.java:被FSNamesystem調用，用於維護名字空間的狀態。記錄NameNode的所有狀態和變化，當 NameNode崩潰時，可以根據這個日誌來恢復。 4)FSDataset.java: 用於DataNode，維護Block序列等 5)Block.java and DatanodeInfo: 用於維護Block信息 6)FSResults.java and FSParam.java: 用於在網絡上傳送參數等 7)FSConstants.java:包含一些常數，用於參數調整等。 8)NDFSClient.java:用於讀寫數據 9)TestClient.java:包含一個main函數，提供一些命令用於對NDFS的存取訪問 6.簡單的例子 1)創建NameNode: Machine A:java net.nutch.fs.NDFS$NameNode 9000 namedir 2)創建DataNode: Machine B:java net.nutch.fs.NDFS$DataNode datadir1 machineB 8000 machineA:9000 Machine C:java net.nutch.fs.NDFS$DataNode datadir2 machineC 8000 machineA:9000

運行1，2步後，則得到了一個NDFS，包含一個NameNode和兩個DataNode。(可以在同一臺機的不同目錄下安裝NDFS) 3)client端的文件訪問：創建文件：java net.nutch.fs.TestClient machineA:9000 CREATE foo.txt 讀取文件：java net.nutch.fs.TestClient machineA:9000 GET foo.txt 重命名文件：java net.nutch.fs.TestClient machineA:9000 RENAME foo.txt bar.txt 再讀取文件：java net.nutch.fs.TestClient machineA:9000 GET bar.txt 刪除文件：java net.nutch.fs.TestClient machineA:9000 DELETE bar.txt

ipc

= IPC =

IPC, for InterProcess Communication, is a fast and easy RPC mechanism. Unlike Sun's standard RPC package, it does not use standard Java serialization and instead requires the author of every class to write the relevant serialization method. That extra work might seem like a drawback, but if you've ever tried to debug Sun's class versioning system, you realize that the extra work is in fact welcome.

IPC does not require a special compiler of any kind to create network stubs and skeletons. Rather, it uses introspection to examine a declared "publicly-available interface" and determine how to marshall/unmarshall arguments.

IPC is used as the internal procedure call mechanism for all of Hadoop and Nutch.

Use Model

IPC is a client/server system. A server process offers service to others by opening a socket and exposing one or more Java interfaces that remote callers can invoke. User server code must indicate the port number and an instance of an object that will receive remote calls. (see RPC.getServer())

A client contacts a server at a specified host and port, and invokes methods exposed by the server. User client code must indicate the target hostname and port, and also the name of the Java interface that the client would like to invoke. While a single IPC server object can expose several interfaces simultaneously, a client can invoke only one of them at a time. (see RPC.getClient())

There is no way for an IPC server to invoke methods of the client. There are places in Hadoop where bidirectional communication is helpful (e.g., in DFS, where the Name and Data nodes must report status to each other). In these cases, one side acts as a client, making the same call over and over again. The server always returns a special "status" object, which the client may then interpret as a request to perform work.

Under the covers

The IPC mechanism automatically inspects the client's requested interface, plus the server's exposed interfaces, and figures out how to marshall/unmarshall arguments for the remote call. This system works fine as long as all arguments in methods consist of either Java's builtin types, or String, or an implementation of the Writable interface. (Or an array of one of those types)

iptvspace

發佈了5 篇原創文章 · 獲贊 3 · 訪問量 7萬+

私信關注

Nutch搜索引擎之分佈式文件系統

投資與合作：網頁遊戲醞釀沸騰前奏

如何測試搜索引擎的索引量大小

俞士汶教授談中文語言處理(一)

俞士汶教授談中文語言處理(二)

Nutch搜索引擎之分佈式文件系統

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結