mfs 分佈式文件系統

mfs權威指南(moosefs)分佈式文件系統一站式解決方案(部署，性能測試)不斷更新

http://bbs.chinaunix.net/thread-1644309-1-1.html

1. 我在性能測試中間遇到些問題，因爲我時間有限，所以希望大家一起來測試解決，羣策羣力。有什麼問題請大家及時指出來，因爲我也處在一個不斷摸索的階段。
2. mfs不多做介紹，具體細節請參考本版mfs實踐文章http://bbs.chinaunix.net/thread-1643863-1-1.html ，或者baidu,google 關鍵字田逸。
3. 希望大家能提供更好的存儲/文件系統的測試模型，來一起完善本文檔。（熱烈歡迎大家奉獻測試腳本，測試用例等）。
4. 希望大家提供生產環境的實際案例，配置環境，腳本，監控機制等等。
5. 希望熟悉代碼的朋友去看看mfs內部實現的機制。
6. 特別感謝田逸的文檔 http://sery.blog.51cto.com/10037/263515 。
7. 特別感謝qq羣戰友:tt，靈犀，流雲風，hzqbbc在qq羣內對廣大愛好者分享寶貴經驗。
8. 特別感謝存儲專家-《大話存儲》的作者：冬瓜頭，在我進行性能測試的時候，對我進行的指導。
9. 特別感謝qq羣戰友：高性能架構，CU ID: leo_ss_pku，製作更專業更精美的pdf版本：

MooseFS權威指南.pdf (3.32 MB, 下載次數: 3388) ，大家也可以他的blog上瀏覽在線版本：http://www.himysql.com/doc/mfs.html

mfs優勢：
-1. Free(GPL)
0. 通用文件系統，不需要修改上層應用就可以使用（那些需要專門api的dfs好麻煩哦！）。
1. 可以在線擴容，體系架構可伸縮性極強。（官方的case可以擴到70臺了！）
2. 部署簡單。（sa們特別高興，領導們特別happy！）
3. 體系架構高可用，所有組件無單點故障。（您還等什麼？）
4. 文件對象高可用，可設置任意的文件冗餘程度（提供比raid1+0更高的冗餘級別），而絕對不會影響讀或者寫的性能，只會加速哦！）
5. 提供Windows回收站的功能.（不怕誤操作了，提供類似oralce 的閃回等高級dbms的即時回滾特性，oralce這些特性可是收費的哦！）
6. 提供類似Java語言的 GC（垃圾回收）.
7. 提供netapp，emc，ibm等商業存儲的snapshot特性。
8. google filesystem的一個c實現。（google在前面開路哦！）
9. 提供web gui監控接口。
10. 提高隨機讀或寫的效率（有待進一步證明）。
11. 提高海量小文件的讀寫效率（有待進一步證明）。
可能的瓶頸：
0. master本身的性能瓶頸。（不太恰當的比方：類似mysql 主從複製，從的可以擴展，主的不容易擴展）。（qq羣戰友： hzqbbc）
      短期對策：按業務切分
1. 體系架構存儲文件總數的可遇見的上限。
   （mfs把文件系統的結構緩存到master的內存中，個人認爲文件越多，master的內存消耗越大，8g對應2500kw的文件數，2億文件就得64GB內存）。（qq羣戰友：hzqbbc）
      短期對策：按業務切分
2. 單點故障解決方案的健壯性。（qq羣戰友： tt  , hzqbbc）

架構圖

——————————————————
index
1. mfs master
2. mfs  chunkserver
3. mfs client
4. 系統管理
5. 性能測試
6. 參考文獻
6.1 測試數據
            測試模型1
            測試模型2
7. 感謝
8. 附錄
9. 實際操作案例
10. 生產環境案例
11. web gui 監控
12. 官方關於1.6.x版本的介紹 (中文翻譯：QQ羣戰友 Cuatre )
13. mfs官方英文FAQ（TC版）（提供者：QQ羣戰友靈犀）
14. mfs master 熱備方案
15. mfs nagios監控程序（提供者：QQ羣戰友流雲風）
————————————————
環境：
master       1臺
chunkserver 3臺
client          1臺
os：
centos5.3 x64
1 mfs master
1.1 安裝mfs master
wget http://ncu.dl.sourceforge.net/project/moosefs/moosefs/1.6.11/mfs-1.6.11.tar.gz
tar zxvf mfs-1.6.11.tar.gz
cd mfs-1.6.11
useradd mfs -s /sbin/nologin
./configure --prefix=/usr/local/mfs --with-default-user=mfs --with-default-group=mfs
make
make install
cd /usr/local/mfs/etc/
cp mfsmaster.cfg.dist mfsmaster.cfg
cp mfsexports.cfg.dist mfsexports.cfg
vim mfsmaster.cfg
vim mfsexports.cfg
cd ..
cd var/
mfs/
cp metadata.mfs.empty metadata.mfs
cat metadata.mfs
/usr/local/mfs/sbin/mfsmaster start
ps axu | grep mfsmaster
lsof -i
tail -f /var/log/messages

1.2 啓動master服務
/usr/local/mfs/sbin/mfsmaster start
working directory: /usr/local/mfs/var/mfs
lockfile created and locked
initializing mfsmaster modules ...
loading sessions ... ok
sessions file has been loaded
exports file has been loaded
loading metadata ...
create new empty filesystemmetadata file has been loaded
no charts data file - initializing empty charts
master <-> metaloggers module: listen on *:9419
master <-> chunkservers module: listen on *:9420
main master server module: listen on *:9421
mfsmaster daemon initialized properly

1.3. 停止master服務
/usr/local/mfs/sbin/mfsmaster -s

1.4  啓動和停止web gui
啓動： /usr/local/mfs/sbin/mfscgiserv
停止： kill /usr/local/mfs/sbin/mfscgiserv

1.5  相關配置文件
vim  mfsexports.cfg
192.168.28.0/24  . rw
192.168.28.0/24  /    rw

2. mfs  chunkserver
2.1 從塊設備創建本地文件系統
fdisk -l
mkfs.ext3 /dev/sdb
mkdir /data
chown mfs:mfs /data
mount -t ext3 /dev/sdb /data
df -ah
/dev/sdb             133G  188M  126G 1% /data

2.2 創建50G的loop device文件
df -ah
dd if=/dev/zero of=/opt/mfs.img bs=1M count=50000
losetup /dev/loop0 mfs.img
mkfs.ext3 /dev/loop0
mkdir /data
chown mfs:mfs /data
mount -o loop /dev/loop0 /data
df -ah

2.3 安裝chunkserver
wget http://ncu.dl.sourceforge.net/project/moosefs/moosefs/1.6.11/mfs-1.6.11.tar.gz
tar zxvf mfs-1.6.11.tar.gz
cd mfs-1.6.11
useradd mfs -s /sbin/nologin
./configure --prefix=/usr/local/mfs --with-default-user=mfs --with-default-group=mfs
make
make install
cd /usr/local/mfs/etc/
cp mfschunkserver.cfg.dist mfschunkserver.cfg
cp mfshdd.cfg.dist mfshdd.cfg

2.4 啓動chunkvserver
/usr/local/mfs/sbin/mfschunkserver start
ps axu |grep mfs
tail -f /var/log/messages

2.5 停止chunksever
/usr/local/mfs/sbin/mfschunkserver stop

3. mfs client
3.1 安裝fuse
yum install kernel.x86_64 kernel-devel.x86_64 kernel-headers.x86_64
###reboot server####
yum install fuse.x86_64 fuse-devel.x86_64 fuse-libs.x86_64
modprobe fuse

3.2 安裝mfsclient
wget http://ncu.dl.sourceforge.net/project/moosefs/moosefs/1.6.11/mfs-1.6.11.tar.gz
tar zxvf mfs-1.6.11.tar.gz
cd mfs-1.6.11
useradd mfs -s /sbin/nologin
./configure --prefix=/usr/local/mfs --with-default-user=mfs --with-default-group=mfs --enable-mfsmount
make
make install

3.3 掛載文件系統
cd /mnt/
mkdir mfs
/usr/local/mfs/bin/mfsmount /mnt/mfs/ -H 192.168.28.242

mkdir mfsmeta
/usr/local/mfs/bin/mfsmount -m /mnt/mfsmeta/ -H 192.168.28.242

df -ah

4.系統管理

4.1 管理命令

設置副本的份數，推薦3份
/usr/local/mfs/bin/mfssetgoal -r 3 /mnt/mfs

查看某文件
/usr/local/mfs/bin/mfsgetgoal  /mnt/mfs

查看目錄信息
/usr/local/mfs/bin/mfsdirinfo -H /mnt/mfs

5. 性能測試

5.1 mfs

1. 大文件(block=1M  byte)
dd if=/dev/zero of=1.img bs=1M count=5000
5242880000 bytes (5.2 GB) copied, 48.8481 seconds, 107 MB/s

2. 小文件( 50 byte * 100w個 * 1 client ) ( 1000 * 1000  )  寫入
real   83m41.343s
user 4m17.993s
sys 16m58.939s
列表
time find ./ -type f | nl | tail
999999  ./0/1
1000000 ./0/0
real 0m39.418s
user 0m0.721s
sys 0m0.225s
刪除
time rm -fr *
real 6m35.273s
user 0m0.394s
sys 0m23.546s

3. 小文件( 1K byte * 100w個 * 100 client ) { 1000 * 1000 )
寫入（100client）
time ../../p_touch_file.sh
real 22m51.159s
user 4m42.850s
sys 18m41.437s
列表（1client）
time find ./ | nl | tail
real 0m35.910s
user 0m0.628s
sys 0m0.204s
刪除（1client）
time rm -fr *
real 6m36.530s
user 0m0.697s
sys 0m21.682s

4. 小文件（1k byte* 100w個 * 200 client）  { 1000 * 1000 )
time ../../p_touch_file.sh
real 27m56.656s
user 5m12.195s
sys 20m52.079s

5. 小文件（1k byte* 100w個 * 1000 client）  { 1000 * 1000 )
寫入
time ../../p_touch_file.sh
real 30m30.336s
user 5m6.607s
sys 21m

5.2 本地磁盤
1. 大文件(block=1M  byte)
dd if=/dev/zero of=1.img bs=1M count=5000
5242880000 bytes (5.2 GB) copied, 58.7371 seconds, 89.3 MB/s

2. 小文件(50 byte * 100w個 * 1 client) { 1000 * 1000  ）
寫入
time ../touch_file.sh
real  17m47.746s
user 4m54.068s
sys  12m54.425s
列表
time find ./ -type f | nl | tail
1000000 ./875/582
1000001 ./875/875
real 0m9.120s
user 0m1.102s
sys 0m0.726s
刪除
time rm -fr *
real 0m37.201s
user 0m0.432s
sys 0m15.268s

5.3 基準測試(第一次)
5.3.1  隨機讀

5.3.2 隨機寫

5.3.3 順序讀

5.3.4 順序寫

5.4 基準測試（第2次）
5.4.1 隨機讀

[ 本帖最後由 shinelian 於 2010-1-27 13:36 編輯 ]

mfs, moosefs, 分佈式, 分佈式文件系統, mfs, moosefs, 分佈式, 分佈式文件系統

相關帖子
本版精華
文庫|博客

爲中國民族之崛起而讀書！

請教oracle 10G行專列成寬表。附建表語句、 ... | 看oracle11g和mysql5.6開發手冊 | cursor_sharing=similar | oracle裝過，卸載了又裝，就裝不好了

shinelian

豐衣足食

帖子: 474
主題: 181
精華: 0
可用積分: 546
專家積分: 0
在線時間: 82 小時
註冊時間: 2005-04-12
最後登錄: 2011-05-31

論壇徽章:: 0

2樓 [報告]

發表於 2010-01-14 14:15:11 |只看該作者

續1

6. 參考文獻：
6.1 文獻
http://sery.blog.51cto.com/10037/263515  田逸
http://bbs.chinaunix.net/thread-1643863-1-1.html  ltgzs777
http://www.moosefs.org/  官網
http://bbs.chinaunix.net/thread-1643015-1-2.html 測試工具

6.1  測試數據

性能測試模型1
一個不知道名字的哥們的測試結果，我先貼出來，那哥們看到了密我.

小文件性能測試
二級100*100文件夾	創建	列表	刪除
單片15k.5 ext3 client單進程	real 0m0.762s user 0m0.048s sys 0m0.261s	real 0m0.179s user 0m0.036s sys 0m0.125s	real 0m0.492s user 0m0.036s sys 0m0.456s
單片15k.5 ext3 client 10併發進程	最長時間： real 0m0.724s user 0m0.015s sys 0m0.123s	最長時間： real 0m0.057s user 0m0.006s sys 0m0.025s	最長時間： real 0m0.226s user 0m0.010s sys 0m0.070s
6chunkserver cache client單進程	real 0m2.084s user 0m0.036s sys 0m0.252s	real 0m4.964s user 0m0.043s sys 0m0.615s	real 0m6.661s user 0m0.046s sys 0m0.868s
6chunkserver cache client 10併發進程	最長時間： real 0m1.422s user 0m0.007s sys 0m0.050s	最長時間： real 0m2.022s user 0m0.008s sys 0m0.108s	最長時間： real 0m2.318s user 0m0.008s sys 0m0.136s


二級1000*1000文件夾	創建	列表	刪除
單片15k.5 ext3 client單進程	real 11m37.531s user 0m4.363s sys 0m37.362s	real 39m56.940s user 0m9.277s sys 0m48.261s	real 41m57.803s user 0m10.453s sys 3m11.808s
單片15k.5 ext3 client 10併發進程	最長時間： real 11m7.703s user 0m0.519s sys 0m10.616s	最長時間： real 39m30.678s user 0m1.031s sys 0m4.962s	最長時間： real 40m23.018s user 0m1.043s sys 0m19.618s
6chunkserver cache client單進程	real 3m17.913s user 0m3.268s sys 0m30.192s	real 11m56.645s user 0m3.810s sys 1m10.387s	real 12m14.900s user 0m3.799s sys 1m26.632s
6chunkserver cache client 10併發進程	最長時間： real 1m13.666s user 0m0.328s sys 0m3.295s	最長時間： real 4m31.761s user 0m0.531s sys 0m10.235s	最長時間： real 4m26.962s user 0m0.663s sys 0m13.117s


三級100100100文件夾	創建	列表	刪除
單片15k.5 ext3 client單進程	real 9m51.331s user 0m4.036s sys 0m32.597s	real 27m24.615s user 0m8.907s sys 0m44.240s	real 28m17.194s user 0m10.644s sys 1m34.998s
單片15k.5 ext3 client 10進程	最長時間： real 10m22.170s user 0m0.580s sys 0m11.720s	最長時間： real 33m32.386s user 0m1.127s sys 0m5.280s	最長時間： real 33m7.808s user 0m1.196s sys 0m10.588s
6chunkserver cache client單進程	real 3m21.720s user 0m3.089s sys 0m26.635s	real 9m26.535s user 0m3.901s sys 1m11.756s	real 10m51.558s user 0m4.186s sys 1m26.322s
6chunkserver cache client 10併發進程	最長時間： real 1m23.023s user 0m0.429s sys 0m3.869s	最長時間： real 4m10.617s user 0m0.643s sys 0m11.588s	最長時間： real 4m20.137s user 0m0.649s sys 0m14.120s
6chunkserver cache client 50併發進程	最長時間： real 1m26.388s user 0m0.074s sys 0m0.679s	最長時間： real 4m37.102s user 0m0.132s sys 0m2.160s	最長時間： real 4m37.392s user 0m0.132s sys 0m2.755s
6chunkserver cache client 100併發進程	最長時間： real 1m29.338s user 0m0.062s sys 0m0.363s	最長時間： real 4m54.925s user 0m0.069s sys 0m1.212s	最長時間： real 4m35.845s user 0m0.068s sys 0m1.640s
6chunkserver cache remote client 10併發進程	最長時間： real 4m0.411s user 0m2.985s sys 0m12.287s	最長時間： real 8m31.351s user 0m4.223s sys 0m29.800s	最長時間： real 4m3.271s user 0m3.206s sys 0m11.922s


三級100100100文件夾	1	2	3	4	5
變更日誌/元數據大小	55M左右	60M左右	60M左右	60M左右	60M左右
連續創建耗時	real 4m0.411s user 0m2.985s sys 0m12.287s	real 4m12.309s user 0m3.039s sys 0m12.899s	real 4m14.010s user 0m3.418s sys 0m12.831s	real 4m14.214s user 0m3.247s sys 0m12.871s	real 4m14.417s user 0m3.170s sys 0m12.948s

注：	單盤多進程性能沒有提升，因爲都在io wait，甚至增加進程會消耗大量調度時間
	MFS多進程時性能會提升，主要性能消耗集中在CPU系統時間。因此實際海量小文件性能要大大強於本地文件系統

性能測試模型2 （感謝 qq羣戰友痞子白提供）
兩個Client同時dd測試
數據塊1M 文件大小20G
Client1 寫：68.4MB/s  讀：25.3MB/s
Client2 寫：67.5MB/s  讀：24.7MB/s
總吞吐：寫：135.9MB/s 讀：50.0MB/s

寫命令：dd if=/dev/zero of=/mfs/test.1 bs=1M count=20000
讀命令：dd if=/mfs/test.1 of=/dev/null bs=1M

7. 感謝
田逸
一個不知道名字的哥們（看到請聯繫我）

8. 附錄
8.1  1000 * 1000 * 1 client 腳本
#!/bin/bash
for ((i=0;i<1000;i++))
do
mkdir ${i}
cd ${i}
for ((j=0;j<1000;j++))
   do
      cp /mnt/test ${j}
   done
   cd ..
done
8.2  1000  * 1000  *  （ 100，200 ,1000 client )  腳本
#!/bin/bash
declare -f make_1000_dir_file
cd `pwd`
function make_1000_dir_file {
start=${1}
stop=${2}
for ((i=${start};i<${stop};i++))
do
      mkdir ${i}
      for ((j=0;j<1000;j++))
      do
         cp /mnt/test ${i}/${j}
      done
done
}
i=1
while [ ${i} -le 1000 ]
do
((n=${i}+1))
make_1000_dir_file ${i} $ &
((i=${i}+1))
done
wait

[ 本帖最後由 shinelian 於 2010-1-20 17:23 編輯 ]

爲中國民族之崛起而讀書！

安全技術大系:漏洞管理| 一體機對傳統架構的威脅有多大 | 阿里雲服務器CU特惠進行中 | 研究 MySQL 5.6 查詢問題

shinelian

豐衣足食

帖子: 474
主題: 181
精華: 0
可用積分: 546
專家積分: 0
在線時間: 82 小時
註冊時間: 2005-04-12
最後登錄: 2011-05-31

論壇徽章:: 0

3樓 [報告]

發表於 2010-01-14 14:16:49 |只看該作者

續2

9. 實際操作案例
9.1 默認的垃圾回收時間是86400，存在一種可能性是垃圾還沒回收完，你的存儲容量就暴掉了。（案例提供者shinelian）

方案1：設置垃圾回收時間，積極監控存儲容量。
         經過測試，把垃圾回收時間設置300秒，完全可以正確回收容量。

方案2：手動週期性去刪除metamfs裏的trash目錄下的文件（健壯性還有待測試，反正刪除後容量是回收了，不曉得有沒有什麼後遺症。）
         經過測試，貌似沒後遺症，有後遺症的同學請在qq羣裏面聯繫我。

9.2 mfs 1.6.x的User Guides和FAQ，並和靈犀溝通對文檔中不理解的地方，就理解不一致的地方達成一致。MFS1.6.x比1.5.x中有以下的變化：(特別感謝qq羣內網友流雲風和靈犀 )
   （1）修復1.5.x中在大批量操作時打開文件過多的bug。這個錯誤也在我們此次測試的時候遇到，報的錯誤說是打開的文件過多，造成chunker server的鏈接錯誤。雖然後來的測試中一直想模擬出來這個問題，但是一直無法模擬出來。在1.6.x中解決此問題，就解決了很大的問題。

   （2）新增加了masterlogger服務器。這是在1.5.x中所沒有的，就是做了master服務器的冗餘，進一步的加強的master服務器的穩定性。在mfs體系中master是要求最穩定以及性能要求最高的，因此務必保證master的穩定。

   （3）修改1.5.x中存在的對於壞塊的修復功能。在mfs1.5.x中遇到chunker壞塊校驗，錯誤比較多的是很往往導致master將出現壞塊的chunker自動的剔除出去的情況，此次增加了對壞塊的修復功能，很方便的進行修復，簡化對壞塊的處理功能。

   （4）對metadata和changelog的新認識。之前認爲changelog記錄的是文件的操作，定期的像數據庫的日誌一樣歸檔到metadata中。發現上面的理解存在誤區，真正的是changelog中記錄了對文件的操作，metadata記錄文件的大小和位置。因此metadata是比較重要的，在進行修復的過程中是採用metadata和最後一次的changelog進行修復的。

   （5）MFS文檔中明確指出對於內存和磁盤大小的要求。【In our environment (ca. 500 TiB, 25 million files, 2 million folders distributed on 26 million chunks on 70 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-20% and chunkserver RAM usually consumes about 100MiB (independent of amount of data).
The master server consumes about 30% of CPU (ca. 1500 operations per second) and 8GiB RAM. CPU load depends on amount of operations and RAM on number of files and folders.】

   （6）指出了在測試的過程中多個chunker並不影響寫的速度，但是能加快讀的速度。在原來的基礎上增加一個chunker時，數據會自動同步到新增的chunker上以達到數據的平衡和均衡。

9.3 mfs1.5.x 數據恢復實例（案例分享： QQ羣戰友 Xufeng）
         其實很簡單，就是mfsrestore, 然後啓動服務的時候，並有任何提示信息，進程啓不來,其實是配置文件放PID的目錄會被刪掉，重建這個目錄，然後賦予權限就可以了,我已經做過1.5到1.6的升級，還可以.
         詳情見 Xufeng blog http://snipt.net/iamacnhero/tag/moosefs

10.  生產環境案例（大家踊躍提供，不斷更新中~~~~~~）
http://www.gaokaochina.com  田逸

[ 本帖最後由 shinelian 於 2010-1-16 14:44 編輯 ]

爲中國民族之崛起而讀書！

安全技術大系:漏洞管理| 一體機對傳統架構的威脅有多大 | 阿里雲服務器CU特惠進行中 | 研究 MySQL 5.6 查詢問題

shinelian

豐衣足食

帖子: 474
主題: 181
精華: 0
可用積分: 546
專家積分: 0
在線時間: 82 小時
註冊時間: 2005-04-12
最後登錄: 2011-05-31

論壇徽章:: 0

4樓 [報告]

發表於 2010-01-14 14:18:18 |只看該作者

續3

本帖最後由 shinelian 於 2010-02-22 12:41 編輯

11 web gui 監控

(143.72 KB, 下載次數: 82)(209.36 KB, 下載次數: 81)(212.79 KB, 下載次數: 78)(223.96 KB, 下載次數: 80)

爲中國民族之崛起而讀書！

安全技術大系:漏洞管理| 一體機對傳統架構的威脅有多大 | 阿里雲服務器CU特惠進行中 | 研究 MySQL 5.6 查詢問題

shinelian

豐衣足食

帖子: 474
主題: 181
精華: 0
可用積分: 546
專家積分: 0
在線時間: 82 小時
註冊時間: 2005-04-12
最後登錄: 2011-05-31

論壇徽章:: 0

5樓 [報告]

發表於 2010-01-14 14:19:54 |只看該作者

續4

12. mfs官方關於1.6.x 的介紹翻譯人(QQ羣戰友：Cuatre )

View on new features of next release v 1.6 of Moose File System
關於對MFS(Moose File System)下一個發佈版本V1.6新特性的一些看法
We are about to release a new version of MooseFS which would include a large number of new features and bug fixes. The new features are so significant that we decided to release it under 1.6 version. The newest beta files are in the GIT repository.
我們將要發佈MFS一個最新版本，該版本修復了大量的bug,同時也包含了大量的新特性。這些新特性非常重要和有特色，我們決定在1.6版本進行發佈。最新的beta文件你可以GIT的知識庫找得到。
The key new features/changes of MooseFS 1.6 would include:
MooseFS 1.6的主要特性及變化包括：
General:
Removed duplicate source files.
移除了複製源文件
Strip whitespace at the end of configuration file lines.
配置文件行的末尾將爲空白
Chunkserver:
Chunkserver
Rewritten in multi-threaded model.
重寫了多線成模式
Added periodical chunk testing functionality (HDD_TEST_FREQ option).
增加了定期chunk測試功能（HDD_TEST_FREQ選項）
New -v option (prints version and exits).
新的-v選項（顯示版本）

Master:
Added "noowner" objects flag (causes objects to belong to current user).
增加了"noowner"對象標記（可以使對象屬於當前用戶）
Maintaining `mfsdirinfo` data online, so it doesn't need to be calculated on every request.
保持‘mfsdirinfo’數據在線，這樣就不需要求每一個請求都進行運算。
Filesystem access authorization system (NFS-like mfsexports.cfg file, REJECT_OLD_CLIENTS option) with ro/rw, maproot, mapall and password functionality.
文件系統訪問認證系統（類似於NFS的mfsexports.cfg文件，REJECT_OLD_CLIENTS選項），有ro/rw, maproot, mapall及密碼功能
New -v option (prints version and exits).
新的-v選項（顯示版本）

Mount:
Rewritten options parsing in mount-like way, making possible to use standard FUSE mount utilities (see mfsmount(

manual for new syntax). Note: old syntax is no longer accepted and mountpoint is mandatory now (there is no default).
重寫選項將採用類似於掛載的解析方式，使用標準的FUSE掛載工具集將成爲可能（參見新的mfsmount(

語法手冊）。注：舊的語法現在將不再被支持，而設置掛載點則是必須的。（非默認選項）
Updated for FUSE 2.6+.
升級到FUSE 2.6版本以上
Added password, file data cache, attribute cache and entry cache options. By default attribute cache and directory entry cache are enabled, file data cache and file entry cache are disabled.
增加了密碼，文件數據緩存，屬性緩存及目錄項選項。默認情況下，屬性緩存及目錄項緩存是開啓的，而文件數據緩存和文件項輸入緩存則是關閉的
opendir() no longer reads directory contents- it's done on first readdir() now; fixes "rm -r" on recent Linux/glibc/coreutils combo.
opendir()函數將不再讀取目錄內容-讀取目錄內容現在將由readdir()函數完成；修復了當前Linux/glibc/coreutils組合中的‘rm -r’命令
Fixed mtime setting just before close() (by flushing file on mtime change); fixes mtime preserving on "cp -p".
修復了在close()前的mtime設置（在mtime變化的時候刷新文件）
Added statistics accessible through MFSROOT/.stats pseudo-file.
增加了表示訪問吞吐量的統計僞文件MFSROOT/.stats
Changed master access method for mfstools (direct .master pseudo-file replaced by .masterinfo redirection); fixes possible mfstools race condition and allows to use mfstools on read-only filesystem.
對於mfstools改變了主要的訪問路徑（直接）

Tools:
Units cleanup in values display (exact values, IEC-60027/binary prefixes, SI/decimal prefixes); new options: -n, -h, -H and MFSHRFORMAT environment variable - refer to mfstools(

manual for details).
在單元值顯示方面進行一致化（確切值，IEC-60027/二進制前綴， SI/十進制前綴）；新的選項：-n,-h,-H以及可變的MFSHRFORMAT環境----詳細參見mfstools(

手冊
mfsrgetgoal, mfsrsetgoal, mfsrgettrashtime, mfsrsettrashtime have been deprecated in favour of new "-r" option for mfsgetgoal, mfssetgoal, mfsgettrashtime, mfssettrashtime tools.
我們推薦使用帶新的“-r”選項的mfsgetgoal, mfssetgoal, mfsgettrashtime, mfssettrashtime工具，而不推薦mfsrgetgoal, mfsrsetgoal, mfsrgettrashtime, mfsrsettrashtime工具。（注意前後命令是不一樣的，看起來很類似）
mfssnapshot utility replaced by mfsappendchunks (direct descendant of old utility) and mfsmakesnapshot (which creates "real" recursive snapshots and behaves similar to "cp -r"

.
mfssnapshot工具集取代了mfsappendchunks（老工具集的後續版本）和mfsmakesnapshot（該工具能夠創建“真”的遞歸快照，這個動作類似於執行“cp -r”）工具
New mfsfilerepair utility, which allows partial recovery of file with some missing or broken chunks.
新的mfs文件修復工具集，該工具允許對部分丟失及損壞塊的文件進行恢復
CGI scripts:
First public version of CGI scripts allowing to monitor MFS installation from WWW browser.
第一個允許從WWW瀏覽器監控MFS安裝的CGI腳本發佈版本

13. mfs官方FAQ（TC版）
What average write/read speeds can we expect?
The raw reading / writing speed obviously depends mainly on the performance of the used hard disk drives and the network capacity and its topology and varies from installation to installation. The better performance of hard drives used and better throughput of the net, the higher performance of the whole system.

In our in-house commodity servers (which additionally make lots of extra calculations) and simple gigabyte Ethernet network on a petabyte-class installation
on Linux (Debian) with goal=2 we have write speeds of about 20-30 MiB/s and reads of 30-50MiB/s. For smaller blocks the write speed decreases, but reading is not much affected.

Similar FreeBSD based network has got a bit better writes and worse reads, giving overall a slightly better performance.

Does the goal setting influence writing/reading speeds?
Generally speaking,
it doesn’t. The goal setting can influence the reading speed only under certain conditions. For example, reading the same file at the same time by more than one client would be faster when the file has goal set to 2 and not goal=1.

But the situation in the real world when several computers read the same file at the same moment is very rare; therefore, the goal setting has rather little influence on the reading speeds.

Similarly, the writing speed is not much affected by the goal setting.

How well concurrent read operations are supported?
All read processes are parallel - there is no problem with concurrent reading of the same data by several clients at the same moment.

How much CPU/RAM resources are used?
In our environment (ca. 500 TiB, 25 million files, 2 million folders distributed on 26 million chunks on 70 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-20% and chunkserver RAM usually consumes about 100MiB (independent of amount of data).
The master server consumes about 30% of CPU (ca. 1500 operations per second) and 8GiB RAM. CPU load depends on amount of operations and RAM on number of files and folders.

Is it possible to add/remove chunkservers and disks on fly?
You can add / remove chunkservers on the fly. But mind that it is not wise to disconnect a chunkserver if there exists a chunk with only one copy (marked in orange in the CGI monitor).
You can also disconnect (change) an individual hard drive. The scenario for this operation would be:

Mark the disk(s) for removal
Restart the chunkserver process
Wait for the replication (there should be no “undergoal” or “missing” chunks marked in yellow, orange or red in CGI monitor)
Stop the chunkserver process
Delete entry(ies) of the disconnected disk(s) in 'mfshdd.cfg'
Stop the chunkserver machine
Remove hard drive(s)
Start the machine
Start the chunkserver process

If you have hotswap disk(s) after step 5 you should follow these:

Unmount disk(s)
Remove hard drive(s)
Start the chunkserver process

If you follow the above steps work of client computers would be not interrupted and the whole operation would not be noticed by MooseFS users.

My experience with clustered filesystems is that metadata operations are quite slow. How did you resolve this problem?
We have noticed the problem with slow metadata operations and we decided to cache file system structure in RAM in the metadata server. This is why metadata server has increased memory requirements.

When doing df -h on a filesystem the results are different from what I would expect taking into account actual sizes of written files.
Every chunkserver sends its own disk usage increased by 256MB for each used partition/hdd, and a sum of these master sends to the client as total disk usage. If you have 3 chunkservers with 7 hdd each, your disk usage will be increased by 3*7*256MB (about 5GB). Of course it's not important in real life, when you have for example 150TB of hdd space.

There is one other thing. If you use disks exclusively for MooseFS on chunkservers df will show correct disk usage, but if you have other data on your MooseFS disks df will count your own files too.

If you want to see usage of your MooseFS files use 'mfsdirinfo' command.

Do chunkservers and metadata server do their own checksumming?
Yes there is checksumming done by the system itself. We thought it would be CPU consuming but it is not really. Overhead is about 4B per a 64KiB block which is 4KiB per a 64MiB chunk (per goal).

What sort of sizing is required for the Master server?
The most important factor is RAM of mfsmaster machine, as the full file system structure is cached in RAM for speed. Besides RAM mfsmaster machine needs some space on HDD for main metadata file together with incremental logs.

The size of the metadata file is dependent on the number of files (not on their sizes). The size of incremental logs depends on the number of operations per hour, but length (in hours) of this incremental log is configurable.

1 million files takes approximately 300 MiB of RAM. Installation of 25 million files requires about 8GiB of RAM and 25GiB space on HDD.

When I delete files or directories the MooseFS size doesn’t change. Why?
MooseFS is not erasing files immediately to let you revert the delete operation.

You can configure for how long files are kept in trash and empty the trash manually (to release the space). There are more details here:
http://moosefs.com/pages/userguides.html#2[MB1] in section "Operations specific for MooseFS".

In short - the time of storing a deleted file can be verified by the mfsgettrashtime command and changed with mfssettrashtime.

When I added a third server as an extra chunkserver it looked like it started replicating data to the 3rd server even though the file goal was still set to 2.
Yes. Disk usage ballancer uses chunks independently, so one file could be redistributed across all of your chunkservers.

Is MooseFS 64bit compatible?Yes!

Can I modify the chunk size?
File data is divided into fragments (chunks) with a maximum of 64MiB each. The value of 64 MiB is hard coded into system so you cannot modify its size. We based the chunk size on real-world data and it was a very good compromise between number of chunks and speed of rebalancing / updating the filesystem. Of course if a file is smaller than 64 MiB it occupies less space.

Please note systems we take care of enjoy files of size well exceeding 100GB and there is no chunk size penalty noticeable.

How do I know if a file has been successfully written in MooseFS?
First off, let's briefly discuss the way the writing process is done in file systems and what programming consequences this bears. Basically, files are written through a buffer (write cache) in all contemporary file systems. As a result, execution of the "write" command itself only transfers the data to a buffer (cache), with no actual writing taking place. Hence, a confirmed execution of the "write" command does not mean that the data has been correctly written on a disc. It is only with the correct performance of the "fsync" (or "close"

command that all data kept in buffers (cache) gets physically written. If an error occurs while such buffer-kept data is being written, it could return an incorrect status for the "fsync" (or even "close"

, not only "write" command.
The problem is that a vast majority of programmers do not test the "close" command status (which is generally a mistake, though a very common one). Consequently, a program writing data on a disc may "assume" that the data has been written correctly, while it has actually failed.
As far as MooseFS is concerned – first, its write buffers are larger than in classic file systems (an issue of efficiency); second, write errors may be more frequent than in case of a classic hard drive (the network nature of MooseFS provokes some additional error-inducing situations). As a consequence, the amount of data processed during execution of the "close" command is often significant and if an error occurs while the data is being written, this will be returned in no other way than as an error in execution of the "close" command only.
Hence, before executing "close", it is recommended (especially when using MooseFS) to perform "fsync" after writing in a file and then check the status of "fsync" and – just in case – the status of "close" as well.
NOTE! When "stdio" is used, the "fflush" function only executes the "write" command, so correct execution of "fflush" is not enough grounds to be sure that all data has been written successfully – you should also check the status of "fclose".
One frequent situation in which the above problem may occur is redirecting a standard output of a program to a file in "shell". Bash (and many other programs) does not check the status of "close" execution and so the syntax of the "application > outcome.txt" type may wrap up successfully in "shell", while in fact there has been an error in writing the "outcome.txt" file. You are strongly advised to avoid using the above syntax. If necessary, you can create a simple program reading the standard input and writing everything to a chosen file (but with an appropriate check with the "fsync" command) and then use "application | mysaver outcome.txt", where "mysaver" is the name of your writing program instead of "application > outcome.txt".
Please note that the problem discussed above is in no way exceptional and does not stem directly from the characteristics of MooseFS itself. It may affect any system of files – only that network type systems are more prone to such difficulties. Technically speaking, the above recommendations should be followed at all times (also in case of classic file systems).

mfs 分佈式文件系統

續1

續2

續3

續4

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

vmare tool安裝全屏

結構體指針可以在函數調時能改變實參，字符類型的指針卻不可以

關於內存對齊的說明

如何成功調用JNI的實例

undefined reference to 'sin'

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結