mfs 分佈式文件系統

mfs權威指南(moosefs)分佈式文件系統一站式解決方案(部署,性能測試)不斷更新

http://bbs.chinaunix.net/thread-1644309-1-1.html


1. 我在性能測試中間遇到些問題,因爲我時間有限,所以希望大家一起來測試解決,羣策羣力。有什麼問題請大家及時指出來,因爲我也處在一個不斷摸索的階段。
2. mfs不多做介紹,具體細節請參考本版mfs實踐文章
http://bbs.chinaunix.net/thread-1643863-1-1.html ,或者baidu,google 關鍵字  田逸
3. 希望大家能提供更好的存儲/文件系統的測試模型,來一起完善本文檔。(熱烈歡迎大家奉獻測試腳本測試用例等)。
4. 希望大家提供生產環境的實際案例,配置環境,腳本,監控機制等等。
5. 希望熟悉代碼的朋友去看看mfs內部實現的機制。
6. 特別感謝田逸的文檔 
http://sery.blog.51cto.com/10037/263515 。
7. 特別感謝qq羣戰友:tt,靈犀,流雲風,hzqbbc在qq羣內對廣大愛好者分享寶貴經驗。

8. 特別感謝存儲專家-《大話存儲》的作者:冬瓜頭 ,在我進行性能測試的時候,對我進行的指導。
9. 特別感謝qq羣戰友:高性能架構,CU ID: leo_ss_pku製作更專業更精美的pdf版本:  MooseFS權威指南.pdf (3.32 MB, 下載次數: 3388) , 大家也可以他的blog上瀏覽在線版本:http://www.himysql.com/doc/mfs.html
     
mfs優勢:
-1. Free(GPL)
0. 通用文件系統,不需要修改上層應用就可以使用(那些需要專門api的dfs好麻煩哦!)。
1. 可以在線擴容,體系架構可伸縮性極強。(官方的case可以擴到70臺了!)
2. 部署簡單。(sa們特別高興,領導們特別happy!)
3. 體系架構高可用,所有組件無單點故障。 (您還等什麼?)
4. 文件對象高可用,可設置任意的文件冗餘程度(提供比raid1+0更高的冗餘級別),而絕對不會影響讀或者寫的性能,只會加速哦!)
5. 提供Windows回收站的功能.(不怕誤操作了,提供類似oralce 的閃回等高級dbms的即時回滾特性,oralce這些特性可是收費的哦!)
6. 提供類似Java語言的 GC(垃圾回收).
7. 提供netapp,emc,ibm等商業存儲的snapshot特性。
8. google filesystem的一個c實現。(google在前面開路哦!)
9. 提供web gui監控接口。
10. 提高隨機讀或寫的效率(有待進一步證明)。
11. 提高海量小文件的讀寫效率(有待進一步證明)。
可能的瓶頸:
0. master本身的性能瓶頸。(不太恰當的比方:類似mysql 主從複製,從的可以擴展,主的不容易擴展)。 (qq羣戰友 : hzqbbc
         短期對策:按業務切分
1. 體系架構存儲文件總數的可遇見的上限。
       (mfs把文件系統的結構緩存到master的內存中,個人認爲文件越多,master的內存消耗越大,8g對應2500kw的文件數,2億文件就得64GB內存 )。(qq羣戰友 :hzqbbc
         短期對策:按業務切分
2. 單點故障解決方案的健壯性。(qq羣戰友 : tt  hzqbbc


架構圖
read862.png 

write862.png 

——————————————————
index
1. mfs master
2. mfs  chunkserver
3. mfs client

4. 系統管理
5. 性能測試

6. 參考文獻
6.1 測試數據
                測試模型1 
                測試模型2
7. 感謝
8. 附錄
9. 實際操作案例
10. 生產環境案例
11. web gui 監控
12. 官方關於1.6.x版本的介紹 (中文翻譯:QQ羣戰友 Cuatre )
13. mfs官方英文FAQ(TC版)(提供者:QQ羣戰友 靈犀 

14. mfs master 熱備方案
15. mfs nagios監控程序(提供者:QQ羣戰友 流雲風
————————————————
環境
master          1臺
chunkserver    3臺
client            1臺
os:
centos5.3 x64
1 mfs master
1.1 安裝mfs master
wget http://ncu.dl.sourceforge.net/project/moosefs/moosefs/1.6.11/mfs-1.6.11.tar.gz
tar zxvf mfs-1.6.11.tar.gz 
cd mfs-1.6.11
useradd mfs -s /sbin/nologin
./configure --prefix=/usr/local/mfs --with-default-user=mfs --with-default-group=mfs
make
make install
cd /usr/local/mfs/etc/
cp mfsmaster.cfg.dist mfsmaster.cfg
cp mfsexports.cfg.dist mfsexports.cfg
vim mfsmaster.cfg
vim mfsexports.cfg
cd ..
cd var/
mfs/
cp metadata.mfs.empty metadata.mfs
cat metadata.mfs
/usr/local/mfs/sbin/mfsmaster start 
ps axu | grep mfsmaster
lsof -i
tail -f /var/log/messages 

1.2 啓動master服務
/usr/local/mfs/sbin/mfsmaster start 
working directory: /usr/local/mfs/var/mfs
lockfile created and locked
initializing mfsmaster modules ...
loading sessions ... ok
sessions file has been loaded
exports file has been loaded
loading metadata ...
create new empty filesystemmetadata file has been loaded
no charts data file - initializing empty charts
master <-> metaloggers module: listen on *:9419
master <-> chunkservers module: listen on *:9420
main master server module: listen on *:9421
mfsmaster daemon initialized properly

1.3. 停止master服務
/usr/local/mfs/sbin/mfsmaster -s

1.4  啓動和停止web gui
啓動: /usr/local/mfs/sbin/mfscgiserv
停止: kill /usr/local/mfs/sbin/mfscgiserv 

1.5  相關配置文件
vim  mfsexports.cfg
192.168.28.0/24  . rw
192.168.28.0/24  /       rw

2. mfs  chunkserver
2.1 從塊設備創建本地文件系統
fdisk -l
mkfs.ext3 /dev/sdb 
mkdir /data

chown mfs:mfs /data
mount -t ext3 /dev/sdb /data

df -ah
/dev/sdb              133G  188M  126G   1% /data

2.2 創建50G的loop device文件
df -ah
dd if=/dev/zero of=/opt/mfs.img bs=1M count=50000
losetup /dev/loop0 mfs.img 
mkfs.ext3 /dev/loop0
mkdir /data
chown mfs:mfs /data
mount -o loop /dev/loop0 /data
df -ah



2.3 安裝chunkserver
wget http://ncu.dl.sourceforge.net/project/moosefs/moosefs/1.6.11/mfs-1.6.11.tar.gz
tar zxvf mfs-1.6.11.tar.gz 
cd mfs-1.6.11
useradd mfs -s /sbin/nologin
./configure --prefix=/usr/local/mfs --with-default-user=mfs --with-default-group=mfs
make
make install
cd /usr/local/mfs/etc/
cp mfschunkserver.cfg.dist mfschunkserver.cfg
cp mfshdd.cfg.dist mfshdd.cfg


2.4 啓動chunkvserver
/usr/local/mfs/sbin/mfschunkserver start
ps axu |grep mfs
tail -f /var/log/messages 

2.5 停止chunksever
/usr/local/mfs/sbin/mfschunkserver stop


3. mfs client
3.1 安裝fuse
yum install kernel.x86_64 kernel-devel.x86_64 kernel-headers.x86_64
###reboot server####
yum install fuse.x86_64 fuse-devel.x86_64 fuse-libs.x86_64
modprobe fuse


3.2 安裝mfsclient
wget http://ncu.dl.sourceforge.net/project/moosefs/moosefs/1.6.11/mfs-1.6.11.tar.gz
tar zxvf mfs-1.6.11.tar.gz 
cd mfs-1.6.11
useradd mfs -s /sbin/nologin
./configure --prefix=/usr/local/mfs --with-default-user=mfs --with-default-group=mfs --enable-mfsmount
make 
make install

3.3 掛載文件系統
cd /mnt/
mkdir mfs
/usr/local/mfs/bin/mfsmount /mnt/mfs/ -H 192.168.28.242

mkdir mfsmeta
/usr/local/mfs/bin/mfsmount -m /mnt/mfsmeta/ -H 192.168.28.242

df -ah



4.系統管理

4.1 管理命令

設置副本 的份數,推薦3份
/usr/local/mfs/bin/mfssetgoal -r 3 /mnt/mfs

查看某文件
/usr/local/mfs/bin/mfsgetgoal  /mnt/mfs

查看目錄信息
/usr/local/mfs/bin/mfsdirinfo -H /mnt/mfs



5. 性能測試

5.1 mfs

1. 大文件(block=1M  byte)
dd if=/dev/zero of=1.img bs=1M count=5000
5242880000 bytes (5.2 GB) copied, 48.8481 seconds, 107 MB/s

2. 小文件( 50 byte * 100w個 * 1 client )    ( 1000 * 1000  )  寫入
real    83m41.343s
user    4m17.993s
sys    16m58.939s

列表
time find ./ -type f | nl | tail
999999  ./0/1
1000000 ./0/0
real    0m39.418s
user    0m0.721s
sys    0m0.225s

刪除 
time rm -fr *
real    6m35.273s
user    0m0.394s
sys    0m23.546s

3. 小文件( 1K byte * 100w個 * 100 client )    { 1000 * 1000 )  
寫入(100client)
time ../../p_touch_file.sh
real    22m51.159s
user    4m42.850s
sys    18m41.437s

列表(1client)
time find ./ | nl | tail 
real    0m35.910s
user    0m0.628s
sys    0m0.204s

刪除(1client)
time rm -fr *
real    6m36.530s
user    0m0.697s
sys    0m21.682s


4. 小文件(1k byte* 100w個 * 200 client)  { 1000 * 1000 )  
time ../../p_touch_file.sh
real    27m56.656s
user    5m12.195s
sys    20m52.079s



5. 小文件(1k byte* 100w個 * 1000 client)  { 1000 * 1000 ) 
寫入
time ../../p_touch_file.sh
real    30m30.336s
user    5m6.607s
sys    21m





5.2 本地磁盤
1. 大文件(block=1M  byte)
dd if=/dev/zero of=1.img bs=1M count=5000
5242880000 bytes (5.2 GB) copied, 58.7371 seconds, 89.3 MB/s


2. 小文件(50 byte * 100w個 * 1 client)    { 1000 * 1000  )
寫入
time ../touch_file.sh 
real  17m47.746s
user 4m54.068s
sys  12m54.425s

列表
time find ./ -type f | nl | tail
1000000 ./875/582
1000001 ./875/875
real 0m9.120s
user 0m1.102s
sys 0m0.726s

刪除 
time rm -fr *
real 0m37.201s
user 0m0.432s
sys 0m15.268s


5.3 基準測試(第一次)
5.3.1  隨機讀

random_read_performance.jpg 

5.3.2  隨機寫
random_wirte_performance.jpg 

5.3.3 順序讀
read_performance.jpg 
5.3.4  順序寫
write_performance.jpg 


5.4 基準測試(第2次)
5.4.1 隨機讀
2_random_write_performance.jpg 

[ 本帖最後由 shinelian 於 2010-1-27 13:36 編輯 ]
爲中國民族之崛起而讀書!
 
   

Rank: 1

帖子
474
主題
181
精華
0
可用積分
546
專家積分
0
在線時間
82 小時
註冊時間
2005-04-12
最後登錄
2011-05-31
論壇徽章:
0
2[報告]
 發表於 2010-01-14 14:15:11 |只看該作者

續1

6. 參考文獻:
6.1 文獻
http://sery.blog.51cto.com/10037/263515  田逸
http://bbs.chinaunix.net/thread-1643863-1-1.html  ltgzs777  
http://www.moosefs.org/  官網
http://bbs.chinaunix.net/thread-1643015-1-2.html   測試工具


6.1  測試數據  

性能測試模型1  
一個不知道名字的哥們的測試結果,我先貼出來,那哥們看到了密我.


小文件性能測試


   

二級100*100文件夾


創建


列表


刪除


         

單片15k.5
ext3
client單進程


real
0m0.762s
user
0m0.048s
sys
0m0.261s


real
0m0.179s
user
0m0.036s
sys
0m0.125s


real
0m0.492s
user
0m0.036s
sys
0m0.456s


         

單片15k.5
ext3
client 10併發進程


最長時間:
real
0m0.724s
user
0m0.015s
sys
0m0.123s


最長時間:
real
0m0.057s
user
0m0.006s
sys
0m0.025s


最長時間:
real
0m0.226s
user
0m0.010s
sys
0m0.070s


         

6chunkserver
cache
client
單進程


real
0m2.084s
user
0m0.036s
sys
0m0.252s


real
0m4.964s
user
0m0.043s
sys
0m0.615s


real
0m6.661s
user
0m0.046s
sys
0m0.868s


         

6chunkserver
cache
client 10
併發進程


最長時間:
real
0m1.422s
user
0m0.007s
sys
0m0.050s


最長時間:
real
0m2.022s
user
0m0.008s
sys
0m0.108s


最長時間:
real
0m2.318s
user
0m0.008s
sys
0m0.136s


         
                 
                 

二級1000*1000文件夾


創建


列表


刪除


         

單片15k.5
ext3
client單進程


real
11m37.531s
user
0m4.363s
sys
0m37.362s


real
39m56.940s
user
0m9.277s
sys
0m48.261s


real
41m57.803s
user
0m10.453s
sys
3m11.808s


         

單片15k.5
ext3
client 10併發進程


最長時間:
real
11m7.703s
user
0m0.519s
sys
0m10.616s


最長時間:
real
39m30.678s
user
0m1.031s
sys
0m4.962s


最長時間:
real
40m23.018s
user
0m1.043s
sys
0m19.618s


         

6chunkserver
cache
client
單進程


real
3m17.913s
user
0m3.268s
sys
0m30.192s


real
11m56.645s
user
0m3.810s
sys
1m10.387s


real
12m14.900s
user
0m3.799s
sys
1m26.632s


         

6chunkserver
cache
client 10
併發進程


最長時間:
real
1m13.666s
user
0m0.328s
sys
0m3.295s


最長時間:
real
4m31.761s
user
0m0.531s
sys
0m10.235s


最長時間:
real
4m26.962s
user
0m0.663s
sys
0m13.117s


         
                 
                 

三級100*100*100文件夾


創建


列表


刪除


         

單片15k.5
ext3
client單進程


real
9m51.331s
user
0m4.036s
sys
0m32.597s


real
27m24.615s
user
0m8.907s
sys
0m44.240s


real
28m17.194s
user
0m10.644s
sys
1m34.998s


         

單片15k.5
ext3
client 10進程


最長時間:
real
10m22.170s
user
0m0.580s
sys
0m11.720s


最長時間:
real
33m32.386s
user
0m1.127s
sys
0m5.280s


最長時間:
real
33m7.808s
user
0m1.196s
sys
0m10.588s


         

6chunkserver
cache
client
單進程


real
3m21.720s
user
0m3.089s
sys
0m26.635s


real
9m26.535s
user
0m3.901s
sys
1m11.756s


real
10m51.558s
user
0m4.186s
sys
1m26.322s


         

6chunkserver
cache
client 10
併發進程


最長時間:
real
1m23.023s
user
0m0.429s
sys
0m3.869s


最長時間:
real
4m10.617s
user
0m0.643s
sys
0m11.588s


最長時間:
real
4m20.137s
user
0m0.649s
sys
0m14.120s


         

6chunkserver
cache
client 50
併發進程


最長時間:
real
1m26.388s
user
0m0.074s
sys
0m0.679s


最長時間:
real
4m37.102s
user
0m0.132s
sys
0m2.160s


最長時間:
real
4m37.392s
user
0m0.132s
sys
0m2.755s


         

6chunkserver
cache
client 100
併發進程


最長時間:
real
1m29.338s
user
0m0.062s
sys
0m0.363s


最長時間:
real
4m54.925s
user
0m0.069s
sys
0m1.212s


最長時間:
real
4m35.845s
user
0m0.068s
sys
0m1.640s


         

6chunkserver
cache
remote
client 10
併發進程


最長時間:
real
4m0.411s
user
0m2.985s
sys
0m12.287s


最長時間:
real
8m31.351s
user
0m4.223s
sys
0m29.800s


最長時間:
real
4m3.271s
user
0m3.206s
sys
0m11.922s


         
                 
                 

三級100*100*100文件夾


1


2


3


4


5


     

變更日誌/元數據大小


55M左右


60M左右


60M左右


60M左右


60M左右


     

連續創建耗時


real
4m0.411s
user
0m2.985s
sys
0m12.287s


real
4m12.309s
user
0m3.039s
sys
0m12.899s


real
4m14.010s
user
0m3.418s
sys
0m12.831s


real
4m14.214s
user
0m3.247s
sys
0m12.871s


real
4m14.417s
user
0m3.170s
sys
0m12.948s


     
                 

注:


單盤多進程性能沒有提升,因爲都在io wait,甚至增加進程會消耗大量調度時間


 
 

MFS多進程時性能會提升,主要性能消耗集中在CPU系統時間。因此實際海量小文件性能要大大強於本地文件系統



性能測試模型2 (感謝 qq羣戰友 痞子白 提供)
兩個Client同時dd測試
數據塊1M 文件大小20G
Client1 寫:68.4MB/s  讀:25.3MB/s
Client2 寫:67.5MB/s  讀:24.7MB/s
總吞吐:寫:135.9MB/s 讀:50.0MB/s 

寫命令:dd if=/dev/zero of=/mfs/test.1 bs=1M count=20000
讀命令:dd if=/mfs/test.1 of=/dev/null bs=1M


7. 感謝
田逸
一個不知道名字的哥們(看到請聯繫我)



8. 附錄
8.1  1000 * 1000 * 1 client 腳本
#!/bin/bash
for ((i=0;i<1000;i++))
do
    mkdir ${i}
    cd ${i}
    for ((j=0;j<1000;j++))
      do
        cp /mnt/test ${j}
      done
      cd ..
done
8.2  1000  * 1000  *  ( 100,200 ,1000 client )  腳本
#!/bin/bash
declare -f make_1000_dir_file
cd `pwd`
function make_1000_dir_file {
    start=${1}
    stop=${2}
    for ((i=${start};i<${stop};i++))
    do
        mkdir ${i}
        for ((j=0;j<1000;j++))
        do
            cp /mnt/test ${i}/${j}
        done
    done
}
i=1
while [ ${i} -le 1000 ]
do 
    ((n=${i}+1))
    make_1000_dir_file ${i} $ &
    ((i=${i}+1))
done 
wait

[ 本帖最後由 shinelian 於 2010-1-20 17:23 編輯 ]
爲中國民族之崛起而讀書!
 
   

Rank: 1

帖子
474
主題
181
精華
0
可用積分
546
專家積分
0
在線時間
82 小時
註冊時間
2005-04-12
最後登錄
2011-05-31
論壇徽章:
0
3[報告]
 發表於 2010-01-14 14:16:49 |只看該作者

續2

9. 實際操作案例
9.1 默認的垃圾回收時間是86400,存在一種可能性是垃圾還沒回收完,你的存儲容量就暴掉了。(案例提供者shinelian

方案1:設置垃圾回收時間,積極監控存儲容量。
           經過測試,把垃圾回收時間設置300秒,完全可以正確回收容量

方案2:手動週期性去刪除metamfs裏的trash目錄下的文件(健壯性還有待測試,反正刪除後容量是回收了,不曉得有沒有什麼後遺症。)
           經過測試,貌似沒後遺症,有後遺症的同學請在qq羣裏面聯繫我。


9.2    mfs 1.6.x的User Guides和FAQ,並和靈犀溝通對文檔中不理解的地方,就理解不一致的地方達成一致。MFS1.6.x比1.5.x中有以下的變化:(特別感謝qq羣內網友 流雲風 和 靈犀 )
     (1)修復1.5.x中在大批量操作時打開文件過多的bug。這個錯誤也在我們此次測試的時候遇到,報的錯誤說是打開的文件過多,造成chunker server的鏈接錯誤。雖然後來的測試中一直想模擬出來這個問題,但是一直無法模擬出來。在1.6.x中解決此問題,就解決了很大的問題。

      (2)新增加了masterlogger服務器。這是在1.5.x中所沒有的,就是做了master服務器的冗餘,進一步的加強的master服務器的穩定性。在mfs體系中master是要求最穩定以及性能要求最高的,因此務必保證master的穩定。

      (3)修改1.5.x中存在的對於壞塊的修復功能。在mfs1.5.x中遇到chunker壞塊校驗,錯誤比較多的是很往往導致master將出現壞塊的chunker自動的剔除出去的情況,此次增加了對壞塊的修復功能,很方便的進行修復,簡化對壞塊的處理功能。

      (4)對metadata和changelog的新認識。之前認爲changelog記錄的是文件的操作,定期的像數據庫的日誌一樣歸檔到metadata中。發現上面的理解存在誤區,真正的是changelog中記錄了對文件的操作,metadata記錄文件的大小和位置。因此metadata是比較重要的,在進行修復的過程中是採用metadata和最後一次的changelog進行修復的。
       
      (5)MFS文檔中明確指出對於內存和磁盤大小的要求。【In our environment (ca. 500 TiB, 25 million files, 2 million folders distributed on 26 million chunks on 70 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-20% and chunkserver RAM usually consumes about 100MiB (independent of amount of data). 
The master server consumes about 30% of CPU (ca. 1500 operations per second) and 8GiB RAM. CPU load depends on amount of operations and RAM on number of files and folders.】
       
      (6)指出了在測試的過程中多個chunker並不影響寫的速度,但是能加快讀的速度。在原來的基礎上增加一個chunker時,數據會自動同步到新增的chunker上以達到數據的平衡和均衡。

9.3 mfs1.5.x 數據恢復實例 (案例分享 : QQ羣戰友 Xufeng
            其實很簡單,就是mfsrestore, 然後啓動服務的時候,並有任何提示信息,進程啓不來,其實是配置文件放PID的目錄會被刪掉,重建這個目錄,然後賦予權限就可以了,我已經做過1.5到1.6的升級,還可以.
            詳情見 Xufeng blog http://snipt.net/iamacnhero/tag/moosefs


10.  生產環境案例(大家踊躍提供,不斷更新中~~~~~~)
http://www.gaokaochina.com  田逸



[ 本帖最後由 shinelian 於 2010-1-16 14:44 編輯 ]
爲中國民族之崛起而讀書!
 
   

Rank: 1

帖子
474
主題
181
精華
0
可用積分
546
專家積分
0
在線時間
82 小時
註冊時間
2005-04-12
最後登錄
2011-05-31
論壇徽章:
0
4[報告]
 發表於 2010-01-14 14:18:18 |只看該作者

續3

本帖最後由 shinelian 於 2010-02-22 12:41 編輯

11  web gui 監控

01_info.png 


02_servers.png 

03_disks.png 


04_exports.png 


05_mounts.png 


06_operations.png 


07_master_charts.png 


08_server_charts.png
(143.72 KB, 下載次數: 82)(209.36 KB, 下載次數: 81)(212.79 KB, 下載次數: 78)(223.96 KB, 下載次數: 80)
爲中國民族之崛起而讀書!
 
   

Rank: 1

帖子
474
主題
181
精華
0
可用積分
546
專家積分
0
在線時間
82 小時
註冊時間
2005-04-12
最後登錄
2011-05-31
論壇徽章:
0
5[報告]
 發表於 2010-01-14 14:19:54 |只看該作者

續4

12.  mfs官方關於1.6.x 的介紹   翻譯人(QQ羣戰友:Cuatre )


View on new features of next release v 1.6 of Moose File System
關於對MFS(Moose File System)下一個發佈版本V1.6新特性的一些看法
We are about to release a new version of MooseFS which would include a large number of new features and bug fixes. The new features are so significant that we decided to release it under 1.6 version. The newest beta files are in the GIT repository.
我們將要發佈MFS一個最新版本,該版本修復了大量的bug,同時也包含了大量的新特性。這些新特性非常重要和有特色,我們決定在1.6版本進行發佈。最新的beta文件你可以GIT的知識庫找得到。
The key new features/changes of MooseFS 1.6 would include: 
MooseFS 1.6的主要特性及變化包括:
General: 
Removed duplicate source files. 
移除了複製源文件
Strip whitespace at the end of configuration file lines. 
配置文件行的末尾將爲空白
Chunkserver: 
Chunkserver
Rewritten in multi-threaded model. 
重寫了多線成模式
Added periodical chunk testing functionality (HDD_TEST_FREQ option). 
增加了定期chunk測試功能(HDD_TEST_FREQ選項)
New -v option (prints version and exits). 
新的-v選項(顯示版本)

Master: 
Added "noowner" objects flag (causes objects to belong to current user). 
增加了"noowner"對象標記(可以使對象屬於當前用戶)
Maintaining `mfsdirinfo` data online, so it doesn't need to be calculated on every request. 
保持‘mfsdirinfo’數據在線,這樣就不需要求每一個請求都進行運算。
Filesystem access authorization system (NFS-like mfsexports.cfg file, REJECT_OLD_CLIENTS option) with ro/rw, maproot, mapall and password functionality. 
文件系統訪問認證系統(類似於NFS的mfsexports.cfg文件,REJECT_OLD_CLIENTS選項),有ro/rw, maproot, mapall及密碼功能
New -v option (prints version and exits). 
新的-v選項(顯示版本)

Mount: 
Rewritten options parsing in mount-like way, making possible to use standard FUSE mount utilities (see mfsmount( manual for new syntax). Note: old syntax is no longer accepted and mountpoint is mandatory now (there is no default). 
重寫選項將採用類似於掛載的解析方式,使用標準的FUSE掛載工具集將成爲可能(參見新的mfsmount(語法手冊)。注:舊的語法現在將不再被支持,而設置掛載點則是必須的。(非默認選項)
Updated for FUSE 2.6+. 
升級到FUSE 2.6版本以上
Added password, file data cache, attribute cache and entry cache options. By default attribute cache and directory entry cache are enabled, file data cache and file entry cache are disabled. 
增加了密碼,文件數據緩存,屬性緩存及目錄項選項。默認情況下,屬性緩存及目錄項緩存是開啓的,而文件數據緩存和文件項輸入緩存則是關閉的
opendir() no longer reads directory contents- it's done on first readdir() now; fixes "rm -r" on recent Linux/glibc/coreutils combo. 
opendir()函數將不再讀取目錄內容-讀取目錄內容現在將由readdir()函數完成;修復了當前Linux/glibc/coreutils組合中的‘rm -r’命令
Fixed mtime setting just before close() (by flushing file on mtime change); fixes mtime preserving on "cp -p". 
修復了在close()前的mtime設置(在mtime變化的時候刷新文件)
Added statistics accessible through MFSROOT/.stats pseudo-file. 
增加了表示訪問吞吐量的統計僞文件MFSROOT/.stats
Changed master access method for mfstools (direct .master pseudo-file replaced by .masterinfo redirection); fixes possible mfstools race condition and allows to use mfstools on read-only filesystem. 
對於mfstools改變了主要的訪問路徑(直接)

Tools: 
Units cleanup in values display (exact values, IEC-60027/binary prefixes, SI/decimal prefixes); new options: -n, -h, -H and MFSHRFORMAT environment variable - refer to mfstools( manual for details). 
在單元值顯示方面進行一致化(確切值,IEC-60027/二進制前綴, SI/十進制前綴);新的選項:-n,-h,-H以及可變的MFSHRFORMAT環境----詳細參見mfstools(手冊
mfsrgetgoal, mfsrsetgoal, mfsrgettrashtime, mfsrsettrashtime have been deprecated in favour of new "-r" option for mfsgetgoal, mfssetgoal, mfsgettrashtime, mfssettrashtime tools. 
我們推薦使用帶新的“-r”選項的mfsgetgoal, mfssetgoal, mfsgettrashtime, mfssettrashtime工具,而不推薦mfsrgetgoal, mfsrsetgoal, mfsrgettrashtime, mfsrsettrashtime工具。(注意前後命令是不一樣的,看起來很類似)
mfssnapshot utility replaced by mfsappendchunks (direct descendant of old utility) and mfsmakesnapshot (which creates "real" recursive snapshots and behaves similar to "cp -r"
mfssnapshot工具集取代了mfsappendchunks(老工具集的後續版本)和mfsmakesnapshot(該工具能夠創建“真”的遞歸快照,這個動作類似於執行“cp -r”)工具
New mfsfilerepair utility, which allows partial recovery of file with some missing or broken chunks. 
新的mfs文件修復工具集,該工具允許對部分丟失及損壞塊的文件進行恢復
CGI scripts: 
First public version of CGI scripts allowing to monitor MFS installation from WWW browser. 
第一個允許從WWW瀏覽器監控MFS安裝的CGI腳本發佈版本


13. mfs官方FAQ(TC版)
What average write/read speeds can we expect?
The raw reading / writing speed obviously depends mainly on the performance of the used hard disk drives and the network capacity and its topology and varies from installation to installation. The better performance of hard drives used and better throughput of the net, the higher performance of the whole system.

In our in-house commodity servers (which additionally make lots of extra calculations) and simple gigabyte Ethernet network on a petabyte-class installation
on Linux (Debian) with goal=2 we have write speeds of about 20-30 MiB/s and reads of 30-50MiB/s. For smaller blocks the write speed decreases, but reading is not much affected. 


Similar FreeBSD based network has got a bit better writes and worse reads, giving overall a slightly better performance.

Does the goal setting influence writing/reading speeds?

Generally speaking,
it doesn’t. The goal setting can influence the reading speed only under certain conditions. For example, reading the same file at the same time by more than one client would be faster when the file has goal set to 2 and not goal=1.


But the situation in the real world when several computers read the same file at the same moment is very rare; therefore, the goal setting has rather little influence on the reading speeds.

Similarly, the writing speed is not much affected by the goal setting.


How well concurrent read operations are supported?

All read processes are parallel - there is no problem with concurrent reading of the same data by several clients at the same moment.

How much CPU/RAM resources are used?

In our environment (ca. 500 TiB, 25 million files, 2 million folders distributed on 26 million chunks on 70 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-20% and chunkserver RAM usually consumes about 100MiB (independent of amount of data). 
The master server consumes about 30% of CPU (ca. 1500 operations per second) and 8GiB RAM. CPU load depends on amount of operations and RAM on number of files and folders.

Is it possible to add/remove chunkservers and disks on fly?

You can add / remove chunkservers on the fly. But mind that it is not wise to disconnect a chunkserver if there exists a chunk with only one copy (marked in orange in the CGI monitor). 
You can also disconnect (change) an individual hard drive. The scenario for this operation would be:


  • Mark the disk(s) for removal
  • Restart the chunkserver process
  • Wait for the replication (there should be no “undergoal” or “missing” chunks marked in yellow, orange or red in CGI monitor)
  • Stop the chunkserver process
  • Delete entry(ies) of the disconnected disk(s) in 'mfshdd.cfg'
  • Stop the chunkserver machine
  • Remove hard drive(s)
  • Start the machine
  • Start the chunkserver process

If you have hotswap disk(s) after step 5 you should follow these:
  • Unmount disk(s)
  • Remove hard drive(s)
  • Start the chunkserver process

If you follow the above steps work of client computers would be not interrupted and the whole operation would not be noticed by MooseFS users.

My experience with clustered filesystems is that metadata operations are quite slow. How did you resolve this problem?

We have noticed the problem with slow metadata operations and we decided to cache file system structure in RAM in the metadata server. This is why metadata server has increased memory requirements. 


When doing df -h on a filesystem the results are different from what I would expect taking into account actual sizes of written files.

Every chunkserver sends its own disk usage increased by 256MB for each used partition/hdd, and a sum of these master sends to the client as total disk usage. If you have 3 chunkservers with 7 hdd each, your disk usage will be increased by 3*7*256MB (about 5GB). Of course it's not important in real life, when you have for example 150TB of hdd space.

There is one other thing. If you use disks exclusively for MooseFS on chunkservers df will show correct disk usage, but if you have other data on your MooseFS disks df will count your own files too.

If you want to see usage of your MooseFS files use 'mfsdirinfo' command.


Do chunkservers and metadata server do their own checksumming?

Yes there is checksumming done by the system itself. We thought it would be CPU consuming but it is not really. Overhead is about 4B per a 64KiB block which is 4KiB per a 64MiB chunk (per goal).

What sort of sizing is required for the Master  server? 
The most important factor is RAM of mfsmaster machine, as the full file system structure is cached in RAM for speed. Besides RAM mfsmaster machine needs some space on HDD for main metadata file together with incremental logs. 

The size of the metadata file is dependent on the number of files (not on their sizes). The size of incremental logs depends on the number of operations per hour, but length (in hours) of this incremental log is configurable.

1 million files takes approximately 300 MiB of RAM. Installation of 25 million files requires about 8GiB of RAM and 25GiB space on HDD.


When I delete files or directories the MooseFS size doesn’t change. Why?

MooseFS is not erasing files immediately to let you revert the delete operation.

You can configure for how long files are kept in trash and empty the trash manually (to release the space). There are more details here:
http://moosefs.com/pages/userguides.html#2[MB1] in section "Operations specific for MooseFS".

In short - the time of storing a deleted file can be verified by the 
mfsgettrashtime command and changed with mfssettrashtime.


When I added a third server as an extra chunkserver it looked like it started replicating data to the 3rd server even though the file goal was still set to 2. 

Yes. Disk usage ballancer uses chunks independently, so one file could be redistributed across all of your chunkservers.

Is MooseFS 64bit compatible?Yes!

Can I modify the chunk size?

File data is divided into fragments (chunks) with a maximum of 64MiB each. The value of 64 MiB is hard coded into system so you cannot modify its size. We based the chunk size on real-world data and it was a very good compromise between number of chunks and speed of rebalancing / updating the filesystem. Of course if a file is smaller than 64 MiB it occupies less space. 

Please note systems we take care of enjoy files of size well exceeding 100GB and there is no chunk size penalty noticeable. 

How do I know if a file has been successfully written in MooseFS?

First off, let's briefly discuss the way the writing process is done in file systems and what programming consequences this bears. Basically, files are written through a buffer (write cache) in all contemporary file systems. As a result, execution of the "write" command itself only transfers the data to a buffer (cache), with no actual writing taking place. Hence, a confirmed execution of the "write" command does not mean that the data has been correctly written on a disc. It is only with the correct performance of the "fsync" (or "close" command that all data kept in buffers (cache) gets physically written. If an error occurs while such buffer-kept data is being written, it could return an incorrect status for the "fsync" (or even "close", not only "write" command.
The problem is that a vast majority of programmers do not test the "close" command status (which is generally a mistake, though a very common one). Consequently, a program writing data on a disc may "assume" that the data has been written correctly, while it has actually failed. 
As far as MooseFS is concerned – first, its write buffers are larger than in classic file systems (an issue of efficiency); second, write errors may be more frequent than in case of a classic hard drive (the network nature of MooseFS provokes some additional error-inducing situations). As a consequence, the amount of data processed during execution of the "close" command is often significant and if an error occurs while the data is being written, this will be returned in no other way than as an error in execution of the "close" command only. 
Hence, before executing "close", it is recommended (especially when using MooseFS) to perform "fsync" after writing in a file and then check the status of "fsync" and – just in case – the status of "close" as well. 
NOTE! When "stdio" is used, the "fflush" function only executes the "write" command, so correct execution of "fflush" is not enough grounds to be sure that all data has been written successfully – you should also check the status of "fclose".
One frequent situation in which the above problem may occur is redirecting a standard output of a program to a file in "shell". Bash (and many other programs) does not check the status of "close" execution and so the syntax of the "application > outcome.txt" type may wrap up successfully in "shell", while in fact there has been an error in writing the "outcome.txt" file. You are strongly advised to avoid using the above syntax. If necessary, you can create a simple program reading the standard input and writing everything to a chosen file (but with an appropriate check with the "fsync" command) and then use "application | mysaver outcome.txt", where "mysaver" is the name of your writing program instead of "application > outcome.txt".
Please note that the problem discussed above is in no way exceptional and does not stem directly from the characteristics of MooseFS itself. It may affect any system of files – only that network type systems are more prone to such difficulties. Technically speaking, the above recommendations should be followed at all times (also in case of classic file systems). 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章