Item 48: Use caution when making streams parallel(謹慎使用並行流)

Among mainstream languages, Java has always been at the forefront of providing facilities to ease the task of concurrent programming. When Java was released in 1996, it had built-in support for threads, with synchronization and wait/notify. Java 5 introduced the java.util.concurrent library, with concurrent collections and the executor framework. Java 7 introduced the fork-join package, a high-performance framework for parallel decomposition. Java 8 introduced streams, which can be parallelized with a single call to the parallel method. Writing concurrent programs in Java keeps getting easier, but writing concurrent programs that are correct and fast is as difficult as it ever was. Safety and liveness violations are a fact of life in concurrent programming, and parallel stream pipelines are no exception.

在主流語言中,Java 一直走在提供簡化併發編程任務工具的前列。當 Java 在 1996 年發佈時,它內置了對線程的支持,支持同步和 wait/notify。Java 5 引入了 java.util.concurrent。具有併發集合和執行器框架的併發庫。Java 7 引入了 fork-join 包,這是一個用於並行分解的高性能框架。Java 8 引入了流,它可以通過對 parallel 方法的一次調用來並行化。用 Java 編寫併發程序變得越來越容易,但是編寫正確且快速的併發程序卻和以前一樣困難。在併發編程中,安全性和活性的違反是不可避免的,並行流管道也不例外。

Consider this program from Item 45:

考慮 Item-45 的程序:

// Stream-based program to generate the first 20 Mersenne primes
public static void main(String[] args) {
    primes().map(p -> TWO.pow(p.intValueExact()).subtract(ONE))
    .filter(mersenne -> mersenne.isProbablePrime(50))
    .limit(20)
    .forEach(System.out::println);
}

static Stream<BigInteger> primes() {
    return Stream.iterate(TWO, BigInteger::nextProbablePrime);
}

On my machine, this program immediately starts printing primes and takes 12.5 seconds to run to completion. Suppose I naively try to speed it up by adding a call to parallel() to the stream pipeline. What do you think will happen to its performance? Will it get a few percent faster? A few percent slower? Sadly, what happens is that it doesn’t print anything, but CPU usage spikes to 90 percent and stays there indefinitely (a liveness failure). The program might terminate eventually, but I was unwilling to find out; I stopped it forcibly after half an hour.

在我的機器上,這個程序立即開始打印素數,運行 12.5 秒完成。假設我天真地嘗試通過向流管道添加對 parallel() 的調用來加速它。你認爲它的性能會怎麼樣?它會快幾個百分點嗎?慢了幾個百分點?遺憾的是,它不會打印任何東西,但是 CPU 使用率會飆升到 90%,並且會無限期地停留在那裏(活躍性失敗)。這個項目最終可能會終止,但我不願意知道;半小時後我強行停了下來。

What’s going on here? Simply put, the streams library has no idea how to parallelize this pipeline and the heuristics fail. Even under the best of circumstances, parallelizing a pipeline is unlikely to increase its performance if the source is from Stream.iterate, or the intermediate operation limit is used. This pipeline has to contend with both of these issues. Worse, the default parallelization strategy deals with the unpredictability of limit by assuming there’s no harm in processing a few extra elements and discarding any unneeded results. In this case, it takes roughly twice as long to find each Mersenne prime as it did to find the previous one. Thus, the cost of computing a single extra element is roughly equal to the cost of computing all previous elements combined, and this innocuous-looking pipeline brings the automatic parallelization algorithm to its knees. The moral of this story is simple: Do not parallelize stream pipelines indiscriminately. The performance consequences may be disastrous.

這是怎麼回事?簡單地說,stream 庫不知道如何並行化這個管道,因此啓發式會失敗。即使在最好的情況下,如果源來自 Stream.iterate 或使用 Intermediate 操作限制,並行化管道也不太可能提高其性能。 這條管道必須解決這兩個問題。更糟糕的是,默認的並行化策略通過假設處理一些額外的元素和丟棄任何不需要的結果沒有害處來處理極限的不可預測性。在這種情況下,找到每一個 Mersenne 素數所需的時間大約是找到上一個 Mersenne 素數所需時間的兩倍。因此,計算單個額外元素的成本大致等於計算之前所有元素的總和,而這條看上去毫無問題的管道將自動並行化算法推到了極致。這個故事的寓意很簡單:不要不加區別地將流管道並行化。 性能後果可能是災難性的。

As a rule, performance gains from parallelism are best on streams over ArrayList, HashMap, HashSet, and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator, which is returned by the spliterator method on Stream and Iterable.

通常,並行性帶來的性能提升在 ArrayList、HashMap、HashSet 和 ConcurrentHashMap 實例上的流效果最好;int 數組和 long 數組也在其中。 這些數據結構的共同之處在於,它們都可以被精確且廉價地分割成任意大小的子程序,這使得在並行線程之間劃分工作變得很容易。stream 庫用於執行此任務的抽象是 spliterator,它由流上的 spliterator 方法返回並可迭代。

Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.

所有這些數據結構的另一個重要共同點是,當按順序處理時,它們提供了從優秀到優秀的引用位置:順序元素引用一起存儲在內存中。這些引用引用的對象在內存中可能彼此不太接近,這降低了引用的位置。引用位置對於並行化批量操作非常重要:如果沒有它,線程將花費大量時間空閒,等待數據從內存傳輸到處理器的緩存中。具有最佳引用位置的數據結構是基本數組,因爲數據本身是連續存儲在內存中的。

The nature of a stream pipeline’s terminal operation also affects the effectiveness of parallel execution. If a significant amount of work is done in the terminal operation compared to the overall work of the pipeline and that operation is inherently sequential, then parallelizing the pipeline will have limited effectiveness. The best terminal operations for parallelism are reductions, where all of the elements emerging from the pipeline are combined using one of Stream’s reduce methods, or prepackaged reductions such as min, max, count, and sum. The short-circuiting operations anyMatch, allMatch, and noneMatch are also amenable to parallelism. The operations performed by Stream’s collect method, which are known as mutable reductions, are not good candidates for parallelism because the overhead of combining collections is costly.

流管道 Terminal 操作的性質也會影響並行執行的有效性。如果與管道的總體工作相比,在 Terminal 操作中完成了大量的工作,並且該操作本質上是順序的,那麼管道的並行化將具有有限的有效性。並行性的最佳 Terminal 操作是縮減,其中來自管道的所有元素都使用流的縮減方法之一進行組合,或者使用預先打包的縮減,如最小、最大、計數和和。anyMatch、allMatch 和 noneMatch 的短路操作也適用於並行性。流的 collect 方法執行的操作稱爲可變縮減,它們不是並行性的好候選,因爲組合集合的開銷是昂貴的。

If you write your own Stream, Iterable, or Collection implementation and you want decent parallel performance, you must override the spliterator method and test the parallel performance of the resulting streams extensively. Writing high-quality spliterators is difficult and beyond the scope of this book.

如果你編寫自己的流、Iterable 或 Collection 實現,並且希望獲得良好的並行性能,則必須重寫 spliterator 方法,並廣泛地測試結果流的並行性能。編寫高質量的 spliterator 是困難的,超出了本書的範圍。

Not only can parallelizing a stream lead to poor performance, including liveness failures; it can lead to incorrect results and unpredictable behavior (safety failures). Safety failures may result from parallelizing a pipeline that uses mappers, filters, and other programmer-supplied function objects that fail to adhere to their specifications. The Stream specification places stringent requirements on these function objects. For example, the accumulator and combiner functions passed to Stream’s reduce operation must be associative, non-interfering, and stateless. If you violate these requirements (some of which are discussed in Item 46) but run your pipeline sequentially, it will likely yield correct results; if you parallelize it, it will likely fail, perhaps catastrophically. Along these lines, it’s worth noting that even if the parallelized Mersenne primes program had run to completion, it would not have printed the primes in the correct (ascending) order. To preserve the order displayed by the sequential version, you’d have to replace the forEach terminal operation with forEachOrdered, which is guaranteed to traverse parallel streams in encounter order.

並行化流不僅會導致糟糕的性能,包括活動失敗;它會導致不正確的結果和不可預知的行爲(安全故障)。 如果管道使用映射器、過濾器和其他程序員提供的函數對象,而這些對象沒有遵守其規範,則並行化管道可能導致安全故障。流規範對這些功能對象提出了嚴格的要求。例如,傳遞給流的 reduce 操作的累加器和組合器函數必須是關聯的、不干擾的和無狀態的。如果你違反了這些要求(其中一些要求在 Item-46 中討論),但是按順序運行管道,則可能會產生正確的結果;如果你並行化它,它很可能會失敗,可能是災難性的。沿着這些思路,值得注意的是,即使並行化的 Mersenne 素數程序運行到完成,它也不會以正確的(升序)順序打印素數。爲了保留序列版本所顯示的順序,你必須將 forEach 這一 Terminal 操作替換爲 forEachOrdered,它保證按順序遍歷並行流。

Even assuming that you’re using an efficiently splittable source stream, a parallelizable or cheap terminal operation, and non-interfering function objects, you won’t get a good speedup from parallelization unless the pipeline is doing enough real work to offset the costs associated with parallelism. As a very rough estimate, the number of elements in the stream times the number of lines of code executed per element should be at least a hundred thousand [Lea14].

即使假設你正在使用一個高效的可分割源流、一個可並行化的或廉價的 Terminal 操作,以及不受干擾的函數對象,你也不會從並行化中獲得良好的加速,除非管道正在做足夠的實際工作來抵消與並行性相關的成本。作爲一個非常粗略的估計,流中的元素數量乘以每個元素執行的代碼行數至少應該是 100000 [Lea14]。

It’s important to remember that parallelizing a stream is strictly a performance optimization. As is the case for any optimization, you must test the performance before and after the change to ensure that it is worth doing (Item 67). Ideally, you should perform the test in a realistic system setting. Normally, all parallel stream pipelines in a program run in a common fork-join pool. A single misbehaving pipeline can harm the performance of others in unrelated parts of the system.

重要的是要記住,並行化流嚴格來說是一種性能優化。與任何優化一樣,你必須在更改之前和之後測試性能,以確保它值得進行(Item-67)。理想情況下,你應該在實際的系統設置中執行測試。通常,程序中的所有並行流管道都在公共 fork-join 池中運行。一個行爲不當的管道可能會損害系統中不相關部分的其他管道的性能。

If it sounds like the odds are stacked against you when parallelizing stream pipelines, it’s because they are. An acquaintance who maintains a multimillionline codebase that makes heavy use of streams found only a handful of places where parallel streams were effective. This does not mean that you should refrain from parallelizing streams. Under the right circumstances, it is possible to achieve near-linear speedup in the number of processor cores simply by adding a parallel call to a stream pipeline. Certain domains, such as machine learning and data processing, are particularly amenable to these speedups.

如果在並行化流管道時,聽起來你的勝算非常大,那是因爲它們確實如此。一位熟悉的人維護着大量使用流的數百萬在線代碼庫,他發現只有少數幾個地方並行流是有效的。這並不意味着你應該避免並行化流。在適當的情況下,可以通過向流管道添加並行調用來實現處理器內核數量的近乎線性的加速。某些領域,如機器學習和數據處理,特別適合於這些加速。

As a simple example of a stream pipeline where parallelism is effective, consider this function for computing π(n), the number of primes less than or equal to n:

作爲一個簡單的例子,一個流管道並行性是有效的,考慮這個函數計算 π(n),質數數目小於或等於 n:

// Prime-counting stream pipeline - benefits from parallelization
static long pi(long n) {
    return LongStream.rangeClosed(2, n)
    .mapToObj(BigInteger::valueOf)
    .filter(i -> i.isProbablePrime(50))
    .count();
}

On my machine, it takes 31 seconds to compute π(108) using this function. Simply adding a parallel() call reduces the time to 9.2 seconds:

在我的機器上,需要 31 秒計算 π(108) 使用這個函數。簡單地添加 parallel() 調用將時間縮短到 9.2 秒:

// Prime-counting stream pipeline - parallel version
static long pi(long n) {
    return LongStream.rangeClosed(2, n)
    .parallel()
    .mapToObj(BigInteger::valueOf)
    .filter(i -> i.isProbablePrime(50))
    .count();
}

In other words, parallelizing the computation speeds it up by a factor of 3.7 on my quad-core machine. It’s worth noting that this is not how you’d compute π(n) for large values of n in practice. There are far more efficient algorithms, notably Lehmer’s formula.

換句話說,在我的四核計算機上,並行化的計算速度提高了 3.7 倍。值得注意的是,這不是你如何計算 π(n) 爲大 n 的值。有更有效的算法,特別是 Lehmer 公式。

If you are going to parallelize a stream of random numbers, start with a SplittableRandom instance rather than a ThreadLocalRandom (or the essentially obsolete Random). SplittableRandom is designed for precisely this use, and has the potential for linear speedup. ThreadLocalRandom is designed for use by a single thread, and will adapt itself to function as a parallel stream source, but won’t be as fast as SplittableRandom. Random synchronizes on every operation, so it will result in excessive, parallelism-killing contention.

如果要並行化一個隨機數流,可以從一個 SplittableRandom 實例開始,而不是從一個 ThreadLocalRandom(或者本質上已經過時的 random)開始。SplittableRandom 正是爲這種用途而設計的,它具有線性加速的潛力。ThreadLocalRandom 是爲單個線程設計的,它將自適應爲並行流源,但速度沒有 SplittableRandom 快。隨機同步每個操作,因此它將導致過度的並行爭用。

In summary, do not even attempt to parallelize a stream pipeline unless you have good reason to believe that it will preserve the correctness of the computation and increase its speed. The cost of inappropriately parallelizing a stream can be a program failure or performance disaster. If you believe that parallelism may be justified, ensure that your code remains correct when run in parallel, and do careful performance measurements under realistic conditions. If your code remains correct and these experiments bear out your suspicion of increased performance, then and only then parallelize the stream in production code.

總之,甚至不要嘗試並行化流管道,除非你有充分的理由相信它將保持計算的正確性以及提高速度。不適當地並行化流的代價可能是程序失敗或性能災難。如果你認爲並行性是合理的,那麼請確保你的代碼在並行運行時保持正確,並在實際情況下進行仔細的性能度量。如果你的代碼保持正確,並且這些實驗證實了你對提高性能的懷疑,那麼,並且只有這樣,才能在生產代碼中並行化流。


Back to contents of the chapter(返回章節目錄)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章