開源雲存儲gluster學習之旅--安裝,創建,客戶端掛載使用

國慶值班!!!那就整理整理這個上週弄得gluster吧
-----------------------------------------題記----------------------------------------

什麼是Gluster?
Gluster是可伸縮的分佈式文件系統,它將來自多個服務器的磁盤存儲資源聚合到單個全局名稱空間中。

Advantages
Scales to several petabytes
Handles thousands of clients
POSIX compatible
Uses commodity hardware
Can use any ondisk filesystem that supports extended attributes
Accessible using industry standard protocols like NFS and SMB
Provides replication, quotas, geo-replication, snapshots and bitrot detection
Allows optimization for different workloads

Open Source

快速安裝步驟:

Step 1 – Have at least three nodes
Fedora 26 (or later) on 3 nodes named "server1", "server2" and "server3"
A working network connection
At least two virtual disks, one for the OS installation, and one to be used to serve GlusterFS storage (sdb), on each of these VMs. This will emulate a real-world deployment, where you would want to separate GlusterFS storage from the OS install.
Setup NTP on each of these servers to get the proper functioning of many applications on top of filesystem.
Note: GlusterFS stores its dynamically generated configuration files at /var/lib/glusterd. If at any point in time GlusterFS is unable to write to these files (for example, when the backing filesystem is full), it will at minimum cause erratic behavior for your system; or worse, take your system offline completely. It is recommended to create separate partitions for directories such as /var/log to reduce the chances of this happening.

Step 2 - Format and mount the bricks
Perform this step on all the nodes, "server{1,2,3}"

Note: We are going to use the XFS filesystem for the backend bricks. But Gluster is designed to work on top of any filesystem, which supports extended attributes.

The following examples assume that the brick will be residing on /dev/sdb1.

mkfs.xfs -i size=512 /dev/sdb1 mkdir -p /data/brick1 echo '/dev/sdb1 /data/brick1 xfs defaults 1 2' >> /etc/fstab mount -a && mount
You should now see sdb1 mounted at /data/brick1

Step 3 - Installing GlusterFS
Install the software

yum install glusterfs-server
Start the GlusterFS management daemon:

service glusterd start service glusterd status glusterd.service - LSB: glusterfs server        Loaded: loaded (/etc/rc.d/init.d/glusterd)    Active: active (running) since Mon, 13 Aug 2012 13:02:11 -0700; 2s ago   Process: 19254 ExecStart=/etc/rc.d/init.d/glusterd start (code=exited, status=0/SUCCESS)    CGroup: name=systemd:/system/glusterd.service        ├ 19260 /usr/sbin/glusterd -p /run/glusterd.pid        ├ 19304 /usr/sbin/glusterfsd --xlator-option georep-server.listen-port=24009 -s localhost...        └ 19309 /usr/sbin/glusterfs -f /var/lib/glusterd/nfs/nfs-server.vol -p /var/lib/glusterd/...
Step 4 - Configure the firewall
The gluster processes on the nodes need to be able to communicate with each other. To simplify this setup, configure the firewall on each node to accept all traffic from the other node.

iptables -I INPUT -p all -s -j ACCEPT
where ip-address is the address of the other node.

Step 5 - Configure the trusted pool
From "server1"

gluster peer probe server2 gluster peer probe server3
Note: When using hostnames, the first server needs to be probed from one other server to set its hostname.

From "server2"

gluster peer probe server1
Note: Once this pool has been established, only trusted members may probe new servers into the pool. A new server cannot probe the pool, it must be probed from the pool.

Check the peer status on server1

gluster peer status
You should see something like this (the UUID will differ)

Number of Peers: 2 Hostname: server2 Uuid: f0e7b138-4874-4bc0-ab91-54f20c7068b4 State: Peer in Cluster (Connected) Hostname: server3 Uuid: f0e7b138-4532-4bc0-ab91-54f20c701241 State: Peer in Cluster (Connected)
Step 6 - Set up a GlusterFS volume
On all servers:

mkdir -p /data/brick1/gv0
From any single server:

gluster volume create gv0 replica 3 server1:/data/brick1/gv0 server2:/data/brick1/gv0 server3:/data/brick1/gv0 gluster volume start gv0
Confirm that the volume shows "Started":

gluster volume info
You should see something like this (the Volume ID will differ):

Volume Name: gv0 Type: Replicate Volume ID: f25cc3d8-631f-41bd-96e1-3e22a4c6f71f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: server1:/data/brick1/gv0 Brick2: server2:/data/brick1/gv0 Brick3: server3:/data/brick1/gv0 Options Reconfigured: transport.address-family: inet
Note: If the volume does not show "Started", the files under /var/log/glusterfs/glusterd.logshould be checked in order to debug and diagnose the situation. These logs can be looked at on one or, all the servers configured.

Step 7 - Testing the GlusterFS volume
For this step, we will use one of the servers to mount the volume. Typically, you would do this from an external machine, known as a "client". Since using this method would require additional packages to be installed on the client machine, we will use one of the servers as a simple place to test first , as if it were that "client".

mount -t glusterfs server1:/gv0 /mnt   for i in `seq -w 1 100`; do cp -rp /var/log/messages /mnt/copy-test-$i; done
First, check the client mount point:

ls -lA /mnt/copy* | wc -l
You should see 100 files returned. Next, check the GlusterFS brick mount points on each server:

ls -lA /data/brick1/gv0/copy*
You should see 100 files on each server using the method we listed here. Without replication, in a distribute only volume (not detailed here), you should see about 33 files on each one.

我自己在安裝的時候用源碼沒有成功,在linux上我更傾向於使用 yum 的 rpm包,法門如下:

yum安裝注意的yum 源問題
https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-6/
cat
/etc/yum.repos.d/gluster.repo
[gluster]
name=gluster
baseurl=https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-6/
gpgcheck=0
enabled=1

url地址可能根據更新在變,登錄 
https://download.gluster.org/pub/gluster/glusterfs/

架構介紹:

GlusterFS 五種卷

Distributed:分佈式卷,文件通過 hash
算法隨機分佈到由 bricks 組成的捲上。

分佈式Glusterfs卷 -這是默認的glusterfs卷,即,如果未指定卷的類型,則在創建卷時,默認選項是創建分佈式卷。在這裏,文件分佈在卷中的各個塊之間。因此,file1只能存儲在brick1或brick2中,而不能存儲在兩者中。因此,沒有數據冗餘。這種存儲卷的目的是輕鬆而便宜地縮放卷大小。但是,這也意味着磚塊故障將導致數據完全丟失,並且必須依靠底層硬件來提供數據丟失保護。


gluster volume create test-volume server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4Creation of test-volume has been successfulPlease start the volume to access data

#gluster volume infoVolume Name: test-volumeType: DistributeStatus: CreatedNumber of Bricks: 4Transport-type: tcpBricks:Brick1: server1:/exp1Brick2: server2:/exp2Brick3: server3:/exp3Brick4: server4:/exp4

Replicated: 複製式卷,類似 RAID
1,replica 數必須等於 volume 中 brick 所包含的存儲服務器數,可用性高。

複製的Glusterfs卷 -在此卷中,我們克服了分佈式卷中面臨的數據丟失問題。此處,數據的精確副本將保留在所有模塊上。卷中的副本數可以由客戶端在創建卷時決定。因此,我們至少需要兩個磚塊才能創建具有2個副本的卷,或者至少需要三個磚塊才能創建3個副本的卷。這種卷的一個主要優點是,即使一個磚塊發生故障,仍然可以從其複製的磚塊訪問數據。這樣的卷用於更好的可靠性和數據冗餘。


#gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2Creation of test-volume has been successful

Please start the volume to access data

Striped: 條帶式卷,類似 RAID 0,stripe 數必須等於 volume 中 brick
所包含的存儲服務器數,文件被分成數據塊,以 Round Robin 的方式存儲在 bricks
中,併發粒度是數據塊,大文件性能好。
–這個一般沒有用的

Distributed Striped: 分佈式的條帶卷,volume中 brick 所包含的存儲服務器數必須是 stripe
的倍數(>=2倍),兼顧分佈式和條帶式的功能。
分佈式複製Glusterfs卷
分佈式複製卷(Distributed
Replicated Glusterfs
Volume),是分佈式卷與複製卷的組合,兼具兩者的功能,特點如下:

若干brick組成1個複製卷,另外若干brick組成其他複製卷;單個文件在複製卷內數據保持副本,不同文件在不同複製卷之間進行哈希分佈;即分佈式卷跨複製卷集(replicated
sets );
brick
server數量是副本數量的倍數,且>=2倍,即最少需要4臺brick
server,同時組建複製卷集的brick容量相等。
即最少需要4臺brick server – 當然了,可能我的 brick server 有不止一個的 brick ,那麼 這個最少需要就不適應了

在此卷中,文件分佈在複製的磚塊集中。塊數必須是副本數的倍數。同樣,我們指定積木的順序也很重要,因爲相鄰的積木成爲彼此的複製品。當由於冗餘和擴展存儲而需要高數據可用性時,使用這種類型的卷。因此,如果有八個磚塊且副本數爲2,則前兩個磚塊成爲彼此的副本,然後成爲接下來的兩個磚塊,依此類推。該體積表示爲4x2。同樣,如果有八個磚塊且副本數爲4,則四個磚塊將成爲彼此的副本,我們將此體積表示爲2x4體積。
根據這個順序可以通過合理的規劃將故障點分散開.
創建分佈式複製卷:

#gluster卷創建NEW-VOLNAME [副本COUNT] [傳輸[tcp | rdma | tcp,rdma]] NEW-BRICK …

例如,具有兩個鏡像的四個節點分佈式(複製)卷:

#gluster volume create test-volume replica 2 transport tcp
server1:/exp1 server2:/exp2 server3:/exp3
server4:/exp4

Creation of test-volume has been successful
Please start the volume to access data

可以看到其實就是 在複製卷的基礎上,以副本數的倍數添加 Brick 即可

關於條帶卷:
個人認爲,無論是 條帶,分佈式條帶 都是由 Brick數和條帶數,Brick的順序 決定的

  1. Brick數=條帶數 --> 條帶卷

gluster volume create
stripe-volume stripe 3 transport tcp glusterfs01:/brick1/str_volume
glusterfs02:/brick2/str_volume
glusterfs03:/brick3/str_volume

這是創建了一個條帶卷,就是最普通的那種,相當於一個raid0,效率好,但是沒有冗餘,而且單點故障之後,文件都瞎了,所以一般沒有用的

Brick數=條帶數的倍數 --> 分佈式條帶卷

#gluster volume create distributed-stripe-volume stripe 2 transport tcp
glusterfs01:/brick1/dis_str_volume
glusterfs02:/brick2/dis_str_volume
glusterfs03:/brick3/dis_str_volume
glusterfs04:/brick4/dis_str_volume

另外,還有兩種條帶卷:
條帶鏡像卷(Deprecated),或者叫條帶複製卷
條帶複製卷(STRIPE REPLICA Volume),是條帶與複製卷的組合,兼具兩者的功能,特點如下:
若干brick組成1個複製卷,另外若干brick組成其他複製卷;單個文件以條帶的形式存儲在2個或多個複製集(replicated
sets ),複製集內文件分片以副本的形式保存;相當於文件級raid01;
brick server數量是副本數的倍數,且>=2倍
創建命令應該是:
#gluster volume
create distributed-stripe-volume
stripe
2
replica 2
transport tcp \ glusterfs01:/brick1/dis_str_volume
glusterfs02:/brick2/dis_str_volume
glusterfs03:/brick3/dis_str_volume
glusterfs04:/brick4/dis_str_volume
分佈式條帶複製卷
分佈式條帶複製卷(DISTRIBUTE STRIPE
REPLICA VOLUME),是分佈式卷,條帶與複製卷的組合,兼具三者的功能,特點如下:
多個文件哈希分佈到到多個條帶集中,單個文件在條帶集中以條帶的形式存儲在2個或多個複製集(replicated
sets ),複製集內文件分片以副本的形式保存;

brick server數量是副本數的倍數,且>=2倍

創建命令應該是:

gluster volume create distributed-stripe-volume
stripe
2
replica 2
transport tcp \ glusterfs01:/brick1/dis_str_volume
glusterfs02:/brick2/dis_str_volume
glusterfs03:/brick3/dis_str_volume
glusterfs04:/brick4/dis_str_volume
\
glusterfs05:/brick5/dis_str_volume
glusterfs06:/brick6/dis_str_volume
\
glusterfs07:/brick7/dis_str_volume
glusterfs08:/brick8/dis_str_volume
\

AFR恢復原理
數據恢復只針對複製卷,AFR數據修復主要涉及三個方面:ENTRY,META,DATA。
記錄描述副本狀態的稱之爲ChangeLog,記錄在每個副本文件擴展屬性裏,讀入內存後以矩陣形式判斷是否需要修復以及要以哪個副本爲Source進行修復;初始值以及正常值爲0(注:ENTRY和META,DATA分佈對應着一個數值)。
以冗餘度爲2,即含有2個副本A和B的DATA修復爲例,write的步驟分解爲:
下發Write操作;
加鎖Lock;
向A,B副本的ChangeLog分別加1,記錄到各副本的擴展屬性中;
對A,B副本進行寫操作;
若副本寫成功則ChangeLog減1,若該副本寫失敗則ChangLog值不變,記錄到各個副本的擴展屬性中;

解鎖UnLock;
向上層返回,只要有一個副本寫成功就返回成功。
上述操作在AFR中是完整的一個transaction動作,根據兩個副本記錄的ChangeLog的數值確定了副本的幾種狀態:
WISE:智慧的,即該副本的ChangeLog中對應的值是0,而另一副本對應的數值大於0;
INNOCENT:無辜的,即兩副本的ChangeLog對應的值都是0;
FOOL:愚蠢的,即該副本的ChangeLog對應的值大於是0,而另一副本對應的數值是0;
IGNORANT,忽略的,即該副本的ChangeLog丟失。
恢復分以下場景:
1個節點changelog狀態爲WISE,其餘節點爲FOOL或其他非WISE狀態,以WISE節點去恢復其他節點;
所有節點是IGNORANT狀態,手動觸發heal,通過命令以UID最小的文件作爲source,去恢復大小爲0的其他文件;
多個狀態是WISE時,即出現腦裂狀態,腦裂的文件通常讀不出來,報"Input/Output
error",可查看日誌/var/log/glusterfs/glustershd.log。
腦裂原理及解決方案:https://docs.gluster.org/en/latest/Administrator
Guide/Split brain and ways to deal with it/

通過命令查看副本文件的擴展屬性:getfattr -m . -d
-e hex [filename]#
“trusted.afr.xxx”部分即擴展屬性,值是24bit,分3部分,依次標識DATA ,META, ENTRY
3者的changelog[root@glusterfs01 ~]# getfattr -m . -d -e hex
/brick1/repl_volume/replica1.txt

客戶端掛載
生產掛載示例
node3.hfvast.com:/HFCloud on
/var/lib/one/datastores type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
手動掛載
mount.glusterfs
node3.hfvast.com:/HFCloud /var/lib/one/datastores
自動掛載-/etc/fstab
192.168.56.11:/gv1 /mnt/glusterfs glusterfs
defaults,_netdev 0
0
  可以使用Gluster Native
Client方法在GNU / Linux客戶端中實現高併發性,性能和透明故障轉移。可以使用NFS
v3訪問gluster卷。已經對GNU / Linux客戶端和其他操作系統中的NFS實現進行了廣泛的測試,例如FreeBSD,Mac
OS X,以及Windows 7(Professional和Up)和Windows Server
2003.其他NFS客戶端實現可以與gluster一起使用NFS服務器。使用Microsoft
Windows以及SAMBA客戶端時,可以使用CIFS訪問卷。對於此訪問方法,Samba包需要存在於客戶端。
  總結:GlusterFS支持三種客戶端類型。Gluster
Native Client、NFS和CIFS。Gluster Native
Client是在用戶空間中運行的基於FUSE的客戶端,官方推薦使用Native
Client,可以使用GlusterFS的全部功能。
1、使用Gluster Native
Client掛載
Gluster Native
Client是基於FUSE的,所以需要保證客戶端安裝了FUSE。這個是官方推薦的客戶端,支持高併發和高效的寫性能。
在開始安裝Gluster Native
Client之前,您需要驗證客戶端上是否已加載FUSE模塊,並且可以訪問所需的模塊,如下所示:
[root@localhost ~]# modprobe
fuse  #將FUSE可加載內核模塊(LKM)添加到Linux內核 [root@localhost ~]# dmesg |
grep -i
fuse  #驗證是否已加載FUSE模塊 [ 569.630373] fuse init (API
version 7.22)
安裝Gluseter
Native Client:
[root@localhost ~]# yum
-y install
glusterfs-client  #安裝glusterfs-client客戶端 [root@localhost ~]#
mkdir
/mnt/glusterfs  #創建掛載目錄 [root@localhost ~]# mount.glusterfs 192.168.56.11:/gv1
/mnt/glusterfs/  #掛載/gv1 [root@localhost ~]# df -h Filesystem Size Used Avail
Use% Mounted on /dev/sda2 20G 1.4G 19G 7% / devtmpfs 231M 0 231M 0% /dev tmpfs 241M 0 241M 0% /dev/shm tmpfs 241M
4.6M 236M
2% /run tmpfs 241M
0 241M 0% /sys/fs/cgroup /dev/sda1 197M
97M 100M 50% /boot
tmpfs 49M 0 49M
0%
/run/user/0
192.168.56.11:/gv1 4.0G 312M 3.7G 8% /mnt/glusterfs [root@localhost
~]# ll /mnt/glusterfs/  #查看掛載目錄的內容 total 100000 -rw-r–r-- 1 root root 102400000 Aug 7 04:30 100M.file [root@localhost ~]#
mount  #查看掛載信息 sysfs on
/sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc
type proc (rw,nosuid,nodev,noexec,relatime) … 192.168.56.11:/gv1 on /mnt/glusterfs type
fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

=================================================

手動掛載卷選項:
使用該mount -t
glusterfs命令時,可以指定以下選項 。請注意,您需要用逗號分隔所有選項。
backupvolfile-server=server-name  #在安裝fuse客戶端時添加了這個選擇,則當第一個vofile服務器故障時,該選項執行的的服務器將用作volfile服務器來安裝客戶端
volfile-max-fetch-attempts=number of attempts  指定在裝入卷時嘗試獲取卷文件的嘗試次數。
log-level=loglevel  #日誌級別 log-file=logfile    #日誌文件
transport=transport-type  #指定傳輸協議 direct-io-mode=[enable|disable]
use-readdirp=[yes|no]  #設置爲ON,則強制在fuse內核模塊中使用readdirp模式

舉個例子:
#mount -t glusterfs -o
backupvolfile-server=volfile_server2,use-readdirp=no,volfile-max-fetch-attempts=2,log-level=WARNING,log-file=/var/log/gluster.log
server1:/test-volume /mnt/glusterfs
自動掛載卷:
除了使用mount掛載,還可以使用/etc/fstab自動掛載
語法格式:HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR glusterfs
defaults,_netdev 0
0 舉個例子: 192.168.56.11:/gv1 /mnt/glusterfs glusterfs
defaults,_netdev 0 0

管理維護篇改天再整理,呵呵~

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章