提升ZFS性能的10個簡便方法

qhooge 0人評論 2975人閱讀 2012-03-22 21:07:47

Copyright © 2009, The e. Publishing Dept. of Morpho Studio (Spruce Int. Found.® ) All rights reserved.

關於ZFS，最經常被提到的問題就是：“如何提升ZFS的性能？”
這不是說ZFS的性能不好，任何現有的文件系統在使用一段時間以後性能都會變差。事實上zfs可以是一個非常快速的文件系統。

ZFS具有最強的自我校正特性和ZFS後臺算法的固有特性，能幫助你在無需昂貴的硬件控制器的情況下達到比大多數RAID控制器和RAID模組更好的性能。所以我們說ZFS是目前業界的第一個真正的RAID（廉價磁盤陣列）解決方案。

大多數我所見到的對ZFS性能問題，根本上都是源於對硬件性能的錯誤假設，或僅僅參照那些不切實際的物理定律做出的判定。

現在，我們應該來看一下10個可以提升ZFS性能的簡便方法，這些每個人都能夠使用，並不需要首先學成ZFS的專家。
爲了便於閱讀，這裏先列出一個內容目錄：

文件系統性能基礎
性能預期，目標與策略
#1: 添加足夠的內存
#2: 添加更多的內存
#3: 增加更多的內存獲得重複數據消除技術帶來的性能提升
#4: 使用固態硬盤（SSDs）提升讀取性能
#5: 使用固態硬盤（SSDs）提升寫入性能
#6: 使用鏡像
#7: 增加更多的磁盤
#8: 保留足夠的空閒空間
#9: 僱傭一個專家
#10: 深入調教——但是你必須清楚自己在做什麼
獎勵：一些五花八門的設定實例
輪到你了
相關文章

在開始我們的性能主題之前，首先複習下基礎知識：

文件系統性能基礎

分清兩類基本的文件系統操作是非常重要的：讀和寫

有人可能會說這是一個很簡單甚至愚蠢的問題。但是你必須有耐心聽我說完，作爲FS文件系統的兩個I/O 子系統，讀和寫的數據操作流程具有非常大的差異，這意味着提升讀/寫性能的方法是有差別的。

我們可以使用 zpool iostat 或 iostat(1M) 命令來覈查系統的讀/寫性能是否符合你的意見或預期。

然後，我們需要了解文件系統兩類性能參數：

Bandwidth（帶寬）: 以MB/s (或 GB/s 如果你幸運的話)爲衡量單位,這個參數告訴你單位時間內有多少數據通過文件系統（被讀取或寫入）。
IOPS: 每秒鐘的IO 操作次數。

再次,這些不同的性能視角在進行性能優化時有不同的意義，你只需要瞭解自己面臨的是哪一類特殊問題。同時讀/寫都有兩種不同的性能模式：

Sequential（順序的）: 可預測的，連續（/相鄰）存儲的數據塊
Random（隨機的）: 不可預測了，無序的，難以連續操作/存取的數據塊

這裏有一個好消息，ZFS 通過稱爲copy-on-write的魔術（操作特性），自動將隨機的寫操作轉化爲連續的寫操作。這是一類較少被其他文件系統較少顧及到的性能問題。

最後，對I/Os操作來說，你應該了一心兩種操作直接的解差別：

Synchronous Writes（同步寫）: 只有在數據寫入穩定的存儲介質（stable storage如磁盤）之後，寫操作才視爲完成。在ZFS中,這類操作會通過ZFS Intent Log, or ZIL來執行. 這類操作最常發生在文件服務器和數據庫服務器上，對磁盤的潛伏週期和IPOS性能最爲敏感。
Asynchronous Writes（異步寫）: 在數據被提交到磁盤前，只要已被緩存到內存後就可返回（進行其他後續數據操作)的寫操作結束。由此很容易獲得性能提升，但是是以犧牲數據存儲可靠性爲代價的。如果在數據被後臺程序真正寫入到磁盤以前系統意外掉電，將有可能發生數據丟失，甚至是更爲嚴重的問題，比如RAID5條帶寫入陷阱的問題（將導致整個條帶數據的校驗錯誤，所以對可靠性要求高的場合需要採用昂貴的後備電源方案）。

性能預期，目標與策略
馬上就將開始今天的性能主題了，不過在開始之前，我們需要理清一些概念：

*確定實際的預期：ZFS是很棒的，是的。但是你需要遵守物理學定律。一個10000 rpm的一個磁盤不能實現超過每秒166次的隨機IOPS，因爲10000 prm（周/分鐘）除以60秒(每分鐘)等於166。這表示磁頭每秒鐘只能在一個隨機街區上方定位它自己的位置166次。任何多於那個數的尋道和你的數據讀/寫其實不是隨機的。磁盤隨機讀/寫操作的最大理論IOPS數就是這麼計算出來的。

與此類似，RAID-Z 意味對於每個RAID-Z磁盤組你只會獲得相當於單個磁盤的IOPS性能，因爲每個文件系統IO 將並行發生在一個RAID-Z磁盤組的全部磁盤上。

你得明確得知你到你的存儲設備的物理限制和你期望的實際性能，在什麼時候分析你性能並且確定性能目標。

*設定性能目標：究竟什麼情況是" 太慢" ? 什麼性能將是可接受的？現在獲得了多大的性能，並且你想要多大的性能？

設定性能目標很重要，因爲他們告訴你什麼時候你已經做到了。總有方法提高性能，但是不惜任何代價提高性能是無用的。知道什麼時候你已經做到了，然後慶祝！

*系統性：我們試驗這，然後我們試驗那，我們用CP(1)來測量，即使我們的應用實際上是數據庫。然後我們各處擰（調整參數），並且通常在我們知道它是什麼之前，我們意識到：我們真的什麼也不知道。

有系統是指確定怎樣的方法測量我們想要的性能，設定系統當前的狀態，然後用我們感興趣的直接與實際應用有關的一種性能測定方式，並堅持在整個性能分析和優化過程中使用相同的方法。

否則，事情變得令人困惑，我們會丟失信號，我們將不能告訴自己是否到達了目標。

現在我們已經理解我們想要那一類的性能提升，我們瞭解基於今天的硬件我們可以期望實現的性能，我們確定一些實際的目標，並且對性能優化有一條有條不紊的方法，下面讓我們開始今天的主題：

#1: 增加足夠的內存

磁盤上的一小部分數據空間被用於存儲ZFS元數據。這些數據是ZFS自身所需要的，用來知道實際的用戶數據在磁盤上的存儲位置。換個說法這些元數據是ZFS用來查找用戶數據的路線圖和數據結構。

如果你的服務器沒有足夠的內存來存儲元數據，那就會耗費額外的元數據讀取IO操作來確定每項需要讀取的數據是否真的位於磁盤上。這將導致用戶數據讀取速度變慢，你應該儘量避免這種情況的發生。如果你的實際可用的內存很小，那麼這會對磁盤的性能造成嚴重的影響。

你需要多少內存？根據thumb的粗略計算規則是你磁盤的總容量除以1千，然後加上爲操作系統保留的1GB 。這意味着每1TB數據，你將需要至少1GB的內存用於緩存ZFS元數據，加上操作系統和其他應用程序所需的額外內存容量。

擁有足夠的內存將使你在數據讀取時獲得收益，不管是讀取操作時隨機的還是順序的，僅僅因爲這些元素數據緩存在內存中，和訪問磁盤相比更容易被找到，所以請確認你的系統具有最少n/1000+1GB的內存，n爲你的存儲池容量（GB）

#2: 增加跟多的內存

ZFS 會使用他找到的每一塊內存來緩存數據。ZFS具有非常精緻的緩存算法，他會嘗試緩存最進使用和最經常使用的數據，根據數據的使用情況自適應平衡兩種數據類型的緩存。ZFS同時還有高級的預讀能力，可以極大得改善不同類型的數據順序讀取性能。

你分配給ZFS的內存越多以上特性就能工作得越好。但是何時你能知道給多的內存是否給你帶來突破性的性能或僅有小的性能提升呢？

這取決於你的工作數據集位置。

你的工作數據集是指那部分你最經常使用的數據:系統上運行主要產品/網站/電子商務數據庫中的內容，你的主機環境中數據流量最大的客戶端程序，你最經常使用的文件等等。

如果你的工作數據數據集能加載到內存中，大多數時間裏主要的數據讀取請求都可以通過訪問內存獲得，而無需創建訪問低速的磁盤的IO操作。

嘗試計算出你最經常使用的數據大小，然後爲你的ZFS服務器添加足夠的內存使其常駐於內存中，將使你獲得最大的讀取性能。

如果你希望更加自動化得進行以上工作， Ben Rockwood 編寫了一個非常棒的工具，稱爲 arc_summary (ARC——ZFS Adaptive Replacement Cache ZFS自適應可調整緩存). 其中兩個"Ghost" 變量將確切的告訴你根據過去的一段時間內數據負載，到底需要增加多少內存，才能幫助你明顯得改善你的ZFS性能。

If you want to influence the balance between user data and metadata in the ZFS ARC cache, check out the primarycache filesystem property that you can set using the zfs(1M) command. For RAM-starved servers with a lot of random reads, it may make sense to restrict the precious RAM cache to metadata and use an L2ARC, explained in tip #4 below.

#3: 增加更多的內存獲得重複數據消除技術帶來的提升性能

在較早的文章裏, 我寫過關於ZFS 重複數據消除（ZFS Deduplication.）的基礎知識。如果你計劃使用這項功能，請記住ZFS將分配一個表格包含文件系統中存儲的每一個數據塊的存儲位置信息以及數據塊的校驗和，然後就能確定是否一個特定的數據塊已經被寫入過，以及安全得將這些數據標記爲重複的。

重複消除技術將能夠節省你的存儲空間，同時因爲節省了不必要的讀寫IOPs ，你的ZFS性能也將獲得提升。但是，使用這一技術的成本是你需要更多的內存來存儲重複數據表（ZFS dedup table），否則額外的低速磁盤的IO操作反而會降低文件系統的性能。

那麼ZFS 重複數據表到底有多大呢？Richard Elling 在最近發表的一篇文章中指出：針對每一個數據塊，ZFS 重複數據表會有一條記錄，每條記錄會使用大約250字節。假設數據塊大小爲8K，那麼每1TB的用戶數據將需要32GB的內存來容納。如果你存儲的主要是大尺寸的文件，那麼你會有一個比較大的平均數據塊大小，比如64K，那你只需要4GB內存就能容納整個重複數據表。

如果你沒有足夠的內存，就不必使用ZFS的重複數據消除技術，否則會帶來額外的磁盤IO開銷，反而降低ZFS的性能。

#4: 使用固態硬盤（SSDs）提升讀取性能

如果你無法爲服務器添加更多的內存（或者你公司的採購部不批准你的申購金額），退一步最好的提升讀取性能的辦法就是爲系統增加固態硬盤（基於閃存）作爲二級ARC緩存（L2ARC）。

你可以通過 zpool(1M)命令非常簡便得完成配置工作,參閱man-page的 "Cache devices" 章節。

SSDs can deliver two orders of magnitude better IOPS than traditional harddisks, and they're much cheaper on a per-GB basis than RAM.
They form an excellent layer of cache between the ZFS RAM-based ARC and the actual disk storage.

You don't need to observe any reliability requirements when configuring L2ARC devices: If they fail, no data is lost because it can always be retrieved from disk.

This means that L2ARC devices can be cheap, but before you start putting USB sticks into your server, you should make sure they deliver a good performance benefit over your rotating disks :).

SSDs come in various sizes: From drop-in-replacements for existing SATA disks in the range of 32GB to the Oracle Sun F20 PCI card with 96GB of flash and built-in SAS controllers (which is one of the secrets behind Oracle Exadata V2's breakthrough performance), to the mighty fast Oracle Sun F5100 flash array (which is the secret behind Oracle's current TPC-C and other world records) with a whopping 1.96TB of pure flash memory and over a million IOPS. Nice!

And since the dedup table is stored in the ZFS ARC and consequently spills off into the L2ARC if available, using SSDs as cache devices will also benefit deduplication performance.

#5: Use SSDs to Improve Write Performance

Most write performance problems are related to synchronous writes. These are mostly found in file servers and database servers.

With synchronous writes, ZFS needs to wait until each particular IO is written to stable storage, and if that's your disk, then it'll need to wait until the rotating rust has spun into the right place, the harddisk's arm moved to the right position, and finally, until the block has been written. This is mechanical, it's latency-bound, it's slow.

See Roch's excellent article on ZFS NFS performance for a more detailed discussion on this.

SSDs can change the whole game for synchronous writes because they have 100x better latency: No moving parts, no waiting, instant writes, instant performance.

So if you're suffering from a high load in synchronous writes, add SSDs as ZFS log devices (aka ZIL, Logzillas) and watch your synchronous writes fly. Check out the zpool(1M) man page under the "Intent Log" section for more details.

Make sure you mirror your ZIL devices: They are there to guarantee the POSIX requirement for "stable storage" so they must function reliably, otherwise data may be lost on power or system failure.

Also, make sure you use high quality SLC Flash Memory devices, because they can give you reliable write transactions. Cheaper MLC cells can damage existing data if the power fails during write operations, something you really don't want.

#6: Use Mirroring

Many people configure their storage for maximum capacity. They just look at how many TB they can get out of their system. After all, storage is expensive, isn't it?

Wrong. Storage capacity is cheap. Every 18 months or so, the same disk only costs half as much, or you can buy double the capacity for the same price, depending on how you view it.

But storage performance can be precious. So why squeeze the last GB out of your storage if capacity is cheap anyway? Wouldn't it make more sense to trade in capacity for speed?

This is what mirroring disks offer as opposed to RAID-Z or RAID-Z2:

RAID-Z(2) groups several disks into a RAID group, called vdevs. This means that every I/O operation at the file system level is going to be translated into a parallel group of I/O operations to all of the disks in the same vdev.
The result: Each RAID group can only deliver the IOPS performance of a single disk, because the transaction always has to wait until all of the disks in the same vdev are finished.
This is both true for reads and for writes: The whole pool can only deliver as many IOPS as the total number of striped vdevs times the IOPS of a single disk.
There are cases where the total bandwidth of RAID-Z can take advantage of the aggregate performance of all drives in parallel, but if you're reading this, you're probably not seeing such a a case.
Mirroring behaves differently: For writes, the rules are the same: Each mirrored pair of disks will deliver the write IOPS of a single disk, because each write transaction will need to wait until it has completed on both disks. But a mirrored pair of disks is a much smaller granularity than your typical RAID-Z set (with up to 10 disks per vdev). For 20 disks, this could be the difference between 10x the IOPS of a disk in the mirror case vs. only 2x the IOPS of a disk in a wide stripes RAID-Z2 scenario (8+2 disks per RAID-Z2 vdev). A 5x performance difference!
For reads, the difference is even bigger: ZFS will round-robin across all of the disks when reading from mirrors. This will give you 20x the IOPS of a single disk in a 20 disk scenario, but still only 2x if you use wide stripes of the 8+2 kind.
Of course, the numbers can change when using smaller RAID-Z stripes, but the basic rules are the same and the best performance is always achieved with mirroring.

For a more detailed discussion on this, I highly recommend Richard Elling's post on ZFS RAID recommendations: Space, performance and MTTDL.

Also, there's some more discussion on this in my earlier RAID-GREED-article.

Bottom line: If you want performance, use mirroring.

#7: Add More Disks

Our next tip was already buried inside tip #6: Add more disks. The more vdevs ZFS has to play with, the more shoulders it can place its load on and the faster your storage performance will become.

This works both for increasing IOPS and for increasing bandwidth, and it'll also add to your storage space, so there's nothing to lose by adding more disks to your pool.

But keep in mind that the performance benefit of adding more disks (and of using mirrors instead of RAID-Z(2)) only accelerates aggregate performance. The performance of every single I/O operation is still confined to that of a single disk's I/O performance.

So, adding more disks does not substitute for adding SSDs or RAM, but it'll certainly help aggregate IOPS and bandwidth for the cases where lots of concurrent IOPS and bigger overall bandwidth are needed.

#8 Leave Enough Free Space

Don't wait until your pool is full before adding new disks, though.

ZFS uses copy on write which means that it writes new data into free blocks, and only when the überblock has been updated, the new state becomes valid.

This is great for performance because it gives ZFS the opportunity to turn random writes into sequential writes - by choosing the right blocks out of the list of free blocks so they're nicely in order and thus can be written to quickly.

That is, when there are enough blocks.

Because if you don't have enough free blocks in your pool, ZFS will be limited in its choice, and that means it won't be able to choose enough blocks that are in order, and hence it won't be able to create an optimal set of sequential writes, which will impact write performance.

As a rule of thumb, don't let your pool become more full than about 80% of its capacity. Once it reaches that point, you should start adding more disks so ZFS has enough free blocks to choose from in sequential write order.

#9: Hire A ZFS Expert

There's a reason why this point comes up almost last: In the utter majority of all ZFS performance cases, one or more of #1-#8 above are almost always the solution.

And they're cheaper than hiring a ZFS performance expert who will likely tell you to add more RAM, or add SSDs or switch from RAID-Z to mirroring after looking at your configuration for a couple of minutes anyway!

But sometimes, a performance problem can be really tricky. You may think it's a storage performance problem, but instead your application may be suffering from an entirely different effect.

Or maybe there are some complex dependencies going on, or some other unusual interaction between CPUs, memory, networking, I/O and storage.

Or perhaps you're hitting a bug or some other strange phenomenon?

So, if all else fails and none of the above options seem to help, contact your favorite Oracle/Sun representative (or send me a mail) and ask for a performance workshop quote.
If your performance problem is really that hard, we want to know about it.

#10: Be An Evil Tuner - But Know What You Do

If you don't want to go for option #9 and if you know what you do, you can check out the ZFS Evil Tuning Guide.

There's a reason it's called "evil": ZFS is not supposed to be tuned. The default values are almost always the right values, and most of the time, changing them won't help, unless you really know what you're doing. So, handle with care.

Still, when people encounter a ZFS performance problem, they tend to Google "ZFS tuning", then they'll find the Evil Tuning Guide, then think that performance is just a matter of setting that magic variable in /etc/system.

This is simply not true.

Measuring performance in a standardized way, setting goals, then sticking to them helps. Adding RAM helps. Using SSDs helps. Thinking about the right number and RAID level of disks helps. Letting ZFS breathe helps.

But tuning kernel parameters is reserved for very special cases, and then you're probably much better off hiring an expert to help you do that correctly.

Bonus: Some Miscellaneous Settings

If you look through the zfs(1M) man page, you'll notice a few performance related properties you can set.
They're not general cures for all performance problems (otherwise they'd be set by default), but they can help in specific situations. Here are a few:

atime: This property controls whether ZFS records the time of last access for reads. Switching this to off will save you extra write IOs when reading data. This can have a big impact if your application doesn't care about the time of last access for a file and if you have a lot of small files that need to be read frequently.
checksum and compression can be double-edged swords: The stronger the checksum, the better your data is protected against corruption (and this is even more important when using dedup). But a stronger checksum method will incur some more load on the CPU for both reading and writing.
Similarly, using compression may save a lot of IOPS if the data can be compressed well, but may be in the way for data that isn't easily compressed. Again, compression costs some extra CPU time.
Keep an eye on CPU load while running tests and if you find that your CPU is under heavy load, you might need to tweak one of these.
recordsize: Don't change this property unless your running a database in this filesystem. ZFS automatically figures out what the best blocksize is for your filesystems.
In case you're running a database (where the file may be big, but the access pattern is always in fixed-size chunks), setting this property to your database record size may help performance a lot.
primarycache and secondarycache: We already introduced the primarycache property in tip #2 above. It controls whether your precious RAM cache should be used for metadata or for both metadata and user data. In cases where you have an SSD configured as a cache device and if you're using a large filesystem, it may help to set primarycache=metadata so the RAM is used for metadata only.
secondarycache does the same for cache devices, but it should be used to cache metadata only in cases where you have really big file systems and almost no real benefit from caching data.
logbias: When executing synchronous writes, there's a tradeoff to be made: Do you want to wait a little, so you can accumulate more synchronous write requests to be written into the log at once, or do you want to service each individual synchronous write as fast as possible, at the expense of throughput?
This property lets you decide which side of the tradeoff you want to favor.

Your Turn

Sorry for the long article. I hope the table of contents at the beginning makes it more digestible, and I hope it's useful to you as a little checklist for ZFS performance planning and for dealing with ZFS performance problems.

Let me know if you want me to split up longer articles like these (though this one is really meant to remain together).

Now it's your turn: What is your experience with ZFS performance? What options from the above list did you implement for what kind of application/problem and what were your results? What helped and what didn't and what are your own ZFS performance secrets?

Share your ZFS performance expertise in the comments section and help others get the best performance out of ZFS!

Related Posts

提升ZFS性能的10個簡便方法頂轉

提升ZFS性能的10個簡便方法

文件系統性能基礎

ESP32使用MicroPython：藍牙

OpenSCAD：3D打印機數字零件庫NopSCADlib

OpenSCAD：幾何對象與修剪擴展庫

ESP32使用MicroPython：WiFi

ESP32使用MicroPython：I2C總線

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

提升ZFS性能的10個簡便方法 頂 轉

提升ZFS性能的10個簡便方法

文件系統性能基礎

提升ZFS性能的10個簡便方法頂轉