【譯】Linus有禮貌地批評了一位開發者關於spinlock

聽說最近Linus耐心而又禮貌地批評了一個開發者。原文在這裏:https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

今天比較有興趣地把原文翻譯了一遍。不是很難懂,但中間還是加了很多譯註,都是筆者自己的一些理解,也不敢保證全對。大體上應該差不多吧。以下先貼筆者的譯文,再貼原文。

譯文:
整篇文章都是錯誤的,並且所測量的內容與作者認爲並聲稱要測量的內容完全不同。

首先,你得知道自旋鎖只應該在不被調度時使用它們。(譯註:這裏的意思是,擁有自旋鎖的線程應該一直佔住CPU,而不應該被調度器調度離開CPU。這裏的“調度”指的是“被調度離開CPU”的意思。) 但是,該博客的作者似乎正在用戶空間中實現自己的自旋鎖,而不關心該鎖的用戶(譯註:這裏的用戶可以理解爲擁有該鎖的線程)是否被調度。 而關於其聲稱的“沒有人拿鎖”的時間的代碼則完全是垃圾。

它在釋放鎖之前記錄一下時間,然後在獲取鎖之後再記錄一下時間,並聲稱這二者的時間差就是沒有人持有鎖的時間。 這簡直是愚蠢、毫無意義和完全是錯誤的。

這是純垃圾。怎麼回事呢:

a. 由於正在自旋,因此正在使用CPU
b. 調度器會在一個隨機時刻拿走CPU使用權(譯註:即schedule out
c. 那個隨機時刻可能正是在記錄時間之後,但在釋放自旋鎖之前

所以現在仍然持有該鎖,但是由於已經用盡了時間片,因此已經不佔住CPU了。 剛纔記錄的“當前時間”已經過時了,它記錄的並不是真正釋放鎖的時刻。

這個時候,有人想要拿該“自旋鎖”,但由於鎖還沒有被釋放,所以那個人現在將自旋以等待一段時間-因爲自旋鎖仍然被先前的線程佔有,並沒有被釋放,只是先前的線程現在沒有佔用CPU而已。 在未來某個時候,調度器會再次調度起先前那個佔用着自旋鎖線程,此時釋放自旋鎖。 然後等待者線程獲得鎖,然後它記錄了一下當前時間,並對比之前記錄的時間,然後說:“哦,這個鎖沒有被人持有已經很久了呀。” (譯註:實際上這個鎖一直被佔有,只是佔有者在上一次記錄時間後就一直沒有在CPU上運行。

注意上面的場景可能還算是好的。如果你擁有的線程多於CPU(可能是其他無關進程的),那麼可能下一個被調度起來的線程並不是要釋放鎖的線程。調度起來的下一個線程可能是另一個想要拿該鎖的線程,而擁有該鎖的線程現在還沒有運行!

因此,所討論的代碼是純粹的垃圾。 你不能那樣做自旋鎖。 或者更確切地說,你可以那樣做,而在執行這些操作時,你正在測量隨機延遲並獲得無意義的值,因爲你要測量的是“我工作很忙,所有進程都受CPU限制, 而且我正在測量調度器將進程保持在適當位置多長時間的隨機點。”

然後,你寫了一篇博客,指責其他人,而沒有理解你自己的錯誤代碼其實是垃圾,並且提供了隨機的垃圾值。

然後,你測試了不同的調度器,並獲得了你認爲有趣的不同隨機值,因爲你認爲它們顯示出一些有關調度器的很棒的東西。

但並非如此。 你只是獲得了隨機值,因爲不同的調度器對“我是否要讓受CPU約束的進程使用較長的時間片”具有不同的啓發算法。 尤其是當每個線程都在愚蠢和錯誤的基準測試中自旋時,它們看起來都像是純吞吐量基準測試,實際上並沒有彼此等待。

你甚至可能會看到諸如“當我將其作爲前臺UI進程運行時,與在後臺作爲批處理進程運行時得到的數值不同”之類的問題。 很酷很有趣的數字,是吧?

不,它們一點都不酷,也不有趣,你剛剛創建了一個特別糟糕的隨機數生成器。

那麼,解決辦法是什麼?

在你告訴系統“你正在等待鎖”的位置使用鎖,而在這裏,釋放鎖的線程會告知你鎖什麼時候被釋放;這樣,調度器就能準確地爲你工作而不是隨機地爲你工作。

注意,當作者使用真正的 std::mutex 時,無論是什麼調度器,事情都能很好地運行。 因爲現在你正在做你想做的事情。 是的,計時值可能仍然沒什麼用-運氣不好歸運氣不好-但至少現在調度程序知道你正在使用鎖。
譯註:這2段的意思大約是,使用系統的鎖,這樣調度器會一直讓持有鎖的線程工作,而不是把它調度離開CPU,因此不會造成CPU浪費的情況發生;而計時的值卻仍可能是不準確的,因爲記錄時間的代碼和釋放鎖的代碼無法做到原子性。

或者,如果你真的想使用自旋鎖(提示:其實你不需要),請確保在拿住鎖的同時,你會一直佔用CPU。 你需要爲此使用實時調度程序(或者是運行於內核內:在內核裏,自旋鎖工作良好,因爲內核本身可以說“嘿,我在運行自旋鎖,你現在不能調度我(譯註:即不能佔用我的CPU,不能把我調度離開CPU)”)。

但是,如果你使用實時調度程序,則需要注意它的其他含義。 有很多,其中一些是致命的。 我強烈建議不要嘗試。 無論如何,你都可能會弄錯許多其他問題,而現在某些錯誤(例如不公平或[優先級倒置])可能會使你的整個事情完全失敗,並且事情從“由於鎖得不好而運行緩慢”到“完全無法工作,因爲我沒有考慮很多其他事情”。

請注意,甚至OS內核也可能會遇到此問題-想象一下在虛擬化環境中,虛擬機管理程序將物理CPU以 overcommitted 的方式分配給虛擬CPU會發生什麼情況? 是的,正是如此,不要那樣做。或者至少要意識到這一點,並使用含虛擬化感知的半虛擬化自旋鎖,以便你可以告訴虛擬機管理程序“嘿,現在別調度我,我處於關鍵區域”。
譯註:這一段大約是說,在虛擬化環境中,當物理CPU過度分配給虛擬CPU後,當虛擬CPU要做自旋鎖的時候,可能因爲要做自旋鎖的線程數量超過物理CPU的限制(即over-commit)而導致一些跑自旋鎖的線程被暫停使用物理CPU的問題。

因爲否則,也許在你完成所有工作之後,剛好要釋放鎖的時候,你被調度器調度離開CPU了;這樣所有想拿鎖的線程都會在你被調度離開CPU的這段時間阻塞住,從而這段時間裏所有人都在CPU上自旋而不會取得任何進展。
譯註:這其實就是說,擁有自旋鎖的線程不應該被調度離開CPU。而要達到這一點,除了使用實時操作系統,只能是在kernel內部使用自旋鎖。因爲在用戶空間使用自旋鎖,是無法保證調度器不把該線程調度離開CPU的。這樣就會導致所有人都沒有進展,即導致了CPU時間的浪費 - 想拿鎖的所有線程都在自旋,而持有鎖的線程被調度離開了CPU。

真的,就是這麼簡單。

這絕對與緩存一致性延遲無關。 它與錯誤實現的鎖定有關。

我重複一遍:不要在用戶空間中使用自旋鎖,除非你實際上知道自己在做什麼。 並且要知道,你知道自己在做什麼的可能性基本上爲零。

有一個非常真實的原因,爲什麼你需要使用 sleeping locks (例如pthread_mutex等)。

實際上,我再多說一點:永遠不要自己製造鎖。 無論它們是否是自旋鎖,你都會出錯。 你將獲得錯誤的 memory ordering,或者你將獲得錯誤的公平性,或者你將遇到諸如上述的“忙碌循環,而擁有鎖的線程已經被調度離開CPU了”之類的問題。

不,在自旋鎖上自旋時加上隨機的“ sched_yield()”調用並不會有幫助。 當人們屈服於所有錯誤的進程時,很容易導致調度風暴。

令人遺憾的是,即使是 system locking 也不一定很棒。 例如,對於許多基準測試,你需要不公平的鎖定,因爲它可以極大地提高吞吐量。 但這可能會導致糟糕的延遲。 而且標準的 system locking (例如pthread_mutex_lock()並沒有flag表示“我關心公平鎖定,因爲延遲比吞吐量更重要”)。
譯註:這裏的意思大約是,system locking 也並不是完美的,它在某些情況下也不會表現出用戶希望的行爲。因爲它只能取折衷的方案,而不會隨着需求的改變而改變自身的行爲。正如沒有一個flag來讓它更關心延遲或更關心吞吐量。

因此,即使你在技術上正確地使用了鎖並避免了徹底的錯誤,對於你的負載,你可能也會得到錯誤的locking行爲。 吞吐量和延遲確實確實具有與locking相反的非常不利的趨勢。 一個不公平的鎖將鎖保留在一個線程中(或將其保留在一個CPU中),可以提供更好的緩存局部性行爲和更好的吞吐量數字。

但是,當其他 CPU core 想拿鎖時,偏向於使用本地線程和本地 CPU core 的不公平鎖就可能會直接導致延遲尖峯,但是將其保留在本地 CPU core 有助於緩存行爲。 相反地,公平的鎖可以避免延遲尖峯,但是會引起大量的跨CPU緩存一致性,因爲拿鎖的區域一般都會積極地從一個CPU遷移到另一個CPU。
譯註:這幾段在反覆說明一個意思,不公平的鎖因爲當代CPU架構設計的原因會使得局部(本地CPU core)的效率很高,但會引起全局其他地方的延遲高;而公平的鎖相反,整個系統延遲不高,但相應的整個系統的吞吐率也低。

通常,不公平的鎖定會導致嚴重的延遲,最終導致大型系統完全無法接受。 但是對於較小的系統,不公平可能不會那麼明顯,但是性能優勢是明顯的,因此係統供應商將選擇不公平但更快的鎖排隊算法。

(幾乎每次我們在內核中選擇一個不公平但快速的鎖定模型時,我們最終都會後悔,並不得不增加公平性)。
因此,你可能希望研究非標準庫的實現,而是考慮滿足特定需要的特定鎖定方式。 誠然,這確實非常非常令人討厭。 但是不要寫你自己的。 找到其他人已經寫好的,並花了數十年的時間對其進行實際調整並使其起作用的。

因爲你永遠都不要以爲自己足夠聰明,可以編寫自己的locking routines.因爲可能是你並沒有這麼聰明(這裏的“你”,也包括了我自己-我們對所有的內核內的鎖已研究了數十年,並且對 ticket locks, cacheline-efficient queuing locks, 都經過了簡單的測試和設置,而即使是知道自己在做什麼的人也往往會多次出錯。

爲何可以找到數十本有關鎖的學術論文,這是有原因的。 真的, 這個很難。

Linus

原文:
The whole post seems to be just wrong, and is measuring something completely different than what the author thinks and claims it is measuring.

First off, spinlocks can only be used if you actually know you’re not being scheduled while using them. But the blog post author seems to be implementing his own spinlocks in user space with no regard for whether the lock user might be scheduled or not. And the code used for the claimed “lock not held” timing is complete garbage.

It basically reads the time before releasing the lock, and then it reads it after acquiring the lock again, and claims that the time difference is the time when no lock was held. Which is just inane and pointless and completely wrong.

That’s pure garbage. What happens is that

(a) since you’re spinning, you’re using CPU time

(b) at a random time, the scheduler will schedule you out

© that random time might ne just after you read the “current time”, but before you actually released the spinlock.

So now you still hold the lock, but you got scheduled away from the CPU, because you had used up your time slice. The “current time” you read is basically now stale, and has nothing to do with the (future) time when you are actually going to release the lock.

Somebody else comes in and wants that “spinlock”, and that somebody will now spin for a long while, since nobody is releasing it - it’s still held by that other thread entirely that was just scheduled out. At some point, the scheduler says “ok, now you’ve used your time slice”, and schedules the original thread, and now the lock is actually released. Then another thread comes in, gets the lock again, and then it looks at the time and says “oh, a long time passed without the lock being held at all”.

And notice how the above is the good schenario. If you have more threads than CPU’s (maybe because of other processes unrelated to your own test load), maybe the next thread that gets shceduled isn’t the one that is going to release the lock. No, that one already got its timeslice, so the next thread scheduled might be another thread that wants that lock that is still being held by the thread that isn’t even running right now!

So the code in question is pure garbage. You can’t do spinlocks like that. Or rather, you very much can do them like that, and when you do that you are measuring random latencies and getting nonsensical values, because what you are measuring is “I have a lot of busywork, where all the processes are CPU-bound, and I’m measuring random points of how long the scheduler kept the process in place”.

And then you write a blog-post blamings others, not understanding that it’s your incorrect code that is garbage, and is giving random garbage values.

And then you test different schedulers, and you get different random values that you think are interesting, because you think they show something cool about the schedulers.

But no. You’re just getting random values because different schedulers have different heuristics for “do I want to let CPU bound processes use long time slices or not”? Particularly in a load where everybody is just spinning on the silly and buggy benchmark, so they all look like they are pure throughput benchmarks and aren’t actually waiting on each other.

You might even see issues like “when I run this as a foreground UI process, I get different numbers than when I run it in the background as a batch process”. Cool interesting numbers, aren’t they?

No, they aren’t cool and interesting at all, you’ve just created a particularly bad random number generator.

So what’s the fix for this?

Use a lock where you tell the system that you’re waiting for the lock, and where the unlocking thread will let you know when it’s done, so that the scheduler can actually work with you, instead of (randomly) working against you.

Notice, how when the author uses an actual std::mutex, things just work fairly well, and regardless of scheduler. Because now you’re doing what you’re supposed to do. Yeah, the timing values might still be off - bad luck is bad luck - but at least now the scheduler is aware that you’re “spinning” on a lock.

Or, if you really want to use use spinlocks (hint: you don’t), make sure that while you hold the lock, you’re not getting scheduled away. You need to use a realtime scheduler for that (or be the kernel: inside the kernel spinlocks are fine, because the kernel itself can say “hey, I’m doing a spinlock, you can’t schedule me right now”).

But if you use a realtime scheduler, you need to be aware of the other implications of that. There are many, and some of them are deadly. I would suggest strongly against trying. You’ll likely get all the other issues wrong anyway, and now some of the mistakes (like unfairness or [priority inversions) can literally hang your whole thing entirely and things go from “slow because I did bad locking” to “not working at all, because I didn’t think through a lot of other things”.

Note that even OS kernels can have this issue - imagine what happens in virtualized environments with overcommitted physical CPU’s scheduled by a hypervisor as virtual CPU’s? Yeah - exactly. Don’t do that. Or at least be aware of it, and have some virtualization-aware paravirtualized spinlock so that you can tell the hypervisor that “hey, don’t do that to me right now, I’m in a critical region”.

Because otherwise you’re going to at some time be scheduled away while you’re holding the lock (perhaps after you’ve done all the work, and you’re just about to release it), and everybody else will be blocking on your incorrect locking while you’re scheduled away and not making any progress. All spinning on CPU’s.

Really, it’s that simple.

This has absolutely nothing to do with cache coherence latencies or anything like that. It has everything to do with badly implemented locking.

I repeat: do not use spinlocks in user space, unless you actually know what you’re doing. And be aware that the likelihood that you know what you are doing is basically nil.

There’s a very real reason why you need to use sleeping locks (like pthread_mutex etc).

In fact, I’d go even further: don’t ever make up your own locking routines. You will get the wrong, whether they are spinlocks or not. You’ll get memory ordering wrong, or you’ll get fairness wrong, or you’ll get issues like the above “busy-looping while somebody else has been scheduled out”.

And no, adding random “sched_yield()” calls while you’re spinning on the spinlock will not really help. It will easily result in scheduling storms while people are yielding to all the wrong processes.

Sadly, even the system locking isn’t necessarily wonderful. For a lot of benchmarks, for example, you want unfair locking, because it can improve throughput enormously. But that can cause bad latencies. And your standard system locking (eg pthread_mutex_lock() may not have a flag to say “I care about fair locking because latency is more important than throughput”.

So even if you get locking technically right and are avoiding the outright bugs, you may get the wrong kind of lock behavior for your load. Throughput and latency really do tend to have very antagonistic tendencies wrt locking. An unfair lock that keeps the lock with one single thread (or keeps it to one single CPU) can give much better cache locality behavior, and much better throughput numbers.

But that unfair lock that prefers local threads and cores might thus directly result in latency spikes when some other core would really want to get the lock, but keeping it core-local helps cache behavior. In contrast, a fair lock avoids the latency spikes, but will cause a lot of cross-CPU cache coherency, because now the locked region will be much more aggressively moving from one CPU to another.

In general, unfair locking can get so bad latency-wise that it ends up being entirely unacceptable for larger systems. But for smaller systems the unfairness might not be as noticeable, but the performance advantage is noticeable, so then the system vendor will pick that unfair but faster lock queueing algorithm.

(Pretty much every time we picked an unfair - but fast - locking model in the kernel, we ended up regretting it eventually, and had to add fairness).

So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don’t write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.

Because you should never ever think that you’re clever enough to write your own locking routines… Because the likelihood is that you aren’t (and by that “you” I very much include myself - we’ve tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

There’s a reason why you can find decades of academic papers on locking. Really. It’s hard.

Linus

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章