Item 46: Prefer side-effect-free functions in streams(在流中使用無副作用的函數)

If you’re new to streams, it can be difficult to get the hang of them. Merely expressing your computation as a stream pipeline can be hard. When you succeed, your program will run, but you may realize little if any benefit. Streams isn’t just an API, it’s a paradigm based on functional programming. In order to obtain the expressiveness, speed, and in some cases parallelizability that streams have to offer, you have to adopt the paradigm as well as the API.

如果你是流的新手,可能很難掌握它們。僅僅將計算表示爲流管道是困難的。當你成功時,你的程序可以運行,但你可能意識不到什麼好處。流不僅僅是一個 API,它是一個基於函數式編程的範式。爲了獲得流提供的可表達性、速度以及在某些情況下的並行性,你必須採納範式和 API。

The most important part of the streams paradigm is to structure your computation as a sequence of transformations where the result of each stage is as close as possible to a pure function of the result of the previous stage. A pure function is one whose result depends only on its input: it does not depend on any mutable state, nor does it update any state. In order to achieve this, any function objects that you pass into stream operations, both intermediate and terminal, should be free of side-effects.

流範式中最重要的部分是將計算構造爲一系列轉換,其中每個階段的結果都儘可能地接近上一階段結果的純函數。純函數的結果只依賴於它的輸入:它不依賴於任何可變狀態,也不更新任何狀態。爲了實現這一點,傳遞到流操作(包括 Intermediate 操作和 Terminal 操作)中的任何函數對象都應該沒有副作用。

譯註:流的操作類型分爲以下幾種:

1、Intermediate

  • 一個流可以後面跟隨零個或多個 intermediate 操作。其目的主要是打開流,做出某種程度的數據映射/過濾,然後返回一個新的流,交給下一個操作使用。這類操作都是惰性化的(lazy),就是說,僅僅調用到這類方法,並沒有真正開始流的遍歷。常見的操作:map(mapToInt、flatMap 等)、filter、distinct、sorted、peek、limit、skip、parallel、sequential、unordered

2、Terminal

  • 一個流只能有一個 terminal 操作,當這個操作執行後,流就被使用「光」了,無法再被操作。所以這必定是流的最後一個操作。Terminal 操作的執行,纔會真正開始流的遍歷,並且會生成一個結果,或者一個 side effect。常見的操作:forEach、forEachOrdered、toArray、reduce、collect、min、max、count、anyMatch、allMatch、noneMatch、findFirst、findAny、iterator

  • 在對於一個流進行多次轉換操作 (Intermediate 操作),每次都對流的每個元素進行轉換,而且是執行多次,這樣時間複雜度就是 N(轉換次數)個 for 循環裏把所有操作都做掉的總和嗎?其實不是這樣的,轉換操作都是 lazy 的,多個轉換操作只會在 Terminal 操作的時候融合起來,一次循環完成。我們可以這樣簡單的理解,流裏有個操作函數的集合,每次轉換操作就是把轉換函數放入這個集合中,在 Terminal 操作的時候循環流對應的集合,然後對每個元素執行所有的函數。

3、short-circuiting

  • 對於一個 intermediate 操作,如果它接受的是一個無限大(infinite/unbounded)的流,但返回一個有限的新流。

  • 對於一個 terminal 操作,如果它接受的是一個無限大的流,但能在有限的時間計算出結果。當操作一個無限大的流,而又希望在有限時間內完成操作,則在管道內擁有一個 short-circuiting 操作是必要非充分條件。常見的操作:anyMatch、allMatch、 noneMatch、findFirst、findAny、limit

Occasionally, you may see streams code that looks like this snippet, which builds a frequency table of the words in a text file:

偶爾,你可能會看到如下使用流的代碼片段,它用於構建文本文件中單詞的頻率表:

// Uses the streams API but not the paradigm--Don't do this!
Map<String, Long> freq = new HashMap<>();
try (Stream<String> words = new Scanner(file).tokens()) {
    words.forEach(word -> {
        freq.merge(word.toLowerCase(), 1L, Long::sum);
    });
}

What’s wrong with this code? After all, it uses streams, lambdas, and method references, and gets the right answer. Simply put, it’s not streams code at all; it’s iterative code masquerading as streams code. It derives no benefits from the streams API, and it’s (a bit) longer, harder to read, and less maintainable than the corresponding iterative code. The problem stems from the fact that this code is doing all its work in a terminal forEach operation, using a lambda that mutates external state (the frequency table). A forEach operation that does anything more than present the result of the computation performed by a stream is a “bad smell in code,” as is a lambda that mutates state. So how should this code look?

這段代碼有什麼問題?畢竟,它使用了流、lambda 表達式和方法引用,並得到了正確的答案。簡單地說,它根本不是流代碼,而是僞裝成流代碼的迭代代碼。它沒有從流 API 中獲得任何好處,而且它(稍微)比相應的迭代代碼更長、更難於閱讀和更難以維護。這個問題源於這樣一個事實:這段代碼在一個 Terminal 操作中(forEach)執行它的所有工作,使用一個會改變外部狀態的 lambda 表達式(頻率表)。forEach 操作除了顯示流執行的計算結果之外,還會執行其他操作,這是一種「代碼中的不良習慣」,就像 lambda 表達式會改變狀態一樣。那麼這段代碼應該是什麼樣的呢?

// Proper use of streams to initialize a frequency table
Map<String, Long> freq;
try (Stream<String> words = new Scanner(file).tokens()) {
    freq = words.collect(groupingBy(String::toLowerCase, counting()));
}

This snippet does the same thing as the previous one but makes proper use of the streams API. It’s shorter and clearer. So why would anyone write it the other way? Because it uses tools they’re already familiar with. Java programmers know how to use for-each loops, and the forEach terminal operation is similar. But the forEach operation is among the least powerful of the terminal operations and the least stream-friendly. It’s explicitly iterative, and hence not amenable to parallelization. The forEach operation should be used only to report the result of a stream computation, not to perform the computation. Occasionally, it makes sense to use forEach for some other purpose, such as adding the results of a stream computation to a preexisting collection.

這個代碼片段與前面的代碼片段做了相同的事情,但是正確地使用了流 API。它更短更清晰。爲什麼有人會用另一種方式寫呢?因爲它使用了他們已經熟悉的工具。Java 程序員知道如何使用 for-each 循環,並且與 forEach 操作是類似的。但是 forEach 操作是 Terminal 操作中功能最弱的操作之一,對流最不友好。它是顯式迭代的,因此不適合並行化。forEach 操作應該只用於報告流計算的結果,而不是執行計算。 有時候,將 forEach 用於其他目的是有意義的,例如將流計算的結果添加到現有集合中。

The improved code uses a collector, which is a new concept that you have to learn in order to use streams. The Collectors API is intimidating: it has thirty-nine methods, some of which have as many as five type parameters. The good news is that you can derive most of the benefit from this API without delving into its full complexity. For starters, you can ignore the Collector interface and think of a collector as an opaque object that encapsulates a reduction strategy. In this context, reduction means combining the elements of a stream into a single object. The object produced by a collector is typically a collection (which accounts for the name collector).

改進後的代碼使用了 collector,這是使用流必須學習的新概念。Collectors 的 API 令人生畏:它有 39 個方法,其中一些方法有多達 5 個類型參數。好消息是,你可以從這個 API 中獲得大部分好處,而不必深入研究它的全部複雜性。對於初學者,可以忽略 Collector 接口,將 collector 視爲封裝了縮減策略的不透明對象。在這種情況下,縮減意味着將流的元素組合成單個對象。collector 生成的對象通常是一個集合(這也解釋了爲何命名爲 collector)。

The collectors for gathering the elements of a stream into a true Collection are straightforward. There are three such collectors: toList(), toSet(), and toCollection(collectionFactory). They return, respectively, a set, a list, and a programmer-specified collection type. Armed with this knowledge, we can write a stream pipeline to extract a top-ten list from our frequency table.

將流的元素收集到一個真正的 Collection 中的 collector 非常簡單。這樣的 collector 有三種:toList()toSet()toCollection(collectionFactory)。它們分別返回 List、Set 和程序員指定的集合類型。有了這些知識,我們就可以編寫一個流管道來從 freq 表中提取前 10 個元素來構成一個新 List。

// Pipeline to get a top-ten list of words from a frequency table
List<String> topTen = freq.keySet().stream()
    .sorted(comparing(freq::get).reversed())
    .limit(10)
    .collect(toList());

Note that we haven’t qualified the toList method with its class, Collectors. It is customary and wise to statically import all members of Collectors because it makes stream pipelines more readable.

注意,我們還沒有用它的類 Collectors 對 toList 方法進行限定。靜態導入 Collectors 的所有成員是習慣用法,也是明智的,因爲這使流管道更具可讀性。

The only tricky part of this code is the comparator that we pass to sorted, comparing(freq::get).reversed(). The comparing method is a comparator construction method (Item 14) that takes a key extraction function. The function takes a word, and the “extraction” is actually a table lookup: the bound method reference freq::get looks up the word in the frequency table and returns the number of times the word appears in the file. Finally, we call reversed on the comparator, so we’re sorting the words from most frequent to least frequent. Then it’s a simple matter to limit the stream to ten words and collect them into a list.

這段代碼中唯一棘手的部分是我們傳遞給 sorted 的 comparing(freq::get).reversed()。comparing 方法是 comparator 的一種構造方法(Item-14),它具有鍵提取功能。函數接受一個單詞,而「提取」實際上是一個表查找:綁定方法引用 freq::get 在 freq 表中查找該單詞,並返回該單詞在文件中出現的次數。最後,我們在比較器上調用 reverse 函數,我們將單詞從最頻繁排序到最不頻繁進行排序。然後,將流限制爲 10 個單詞並將它們收集到一個列表中。

The previous code snippets use Scanner’s stream method to get a stream over the scanner. This method was added in Java 9. If you’re using an earlier release, you can translate the scanner, which implements Iterator, into a stream using an adapter similar to the one in Item 47 (streamOf(Iterable<E>)).

前面的代碼片段使用 Scanner 的流方法在掃描器上獲取流。這個方法是在 Java 9 中添加的。如果使用的是較早的版本,則可以使用類似於 Item-47streamOf(Iterable<E>))中的適配器將實現 Iterator 的掃描程序轉換爲流。

So what about the other thirty-six methods in Collectors? Most of them exist to let you collect streams into maps, which is far more complicated than collecting them into true collections. Each stream element is associated with a key and a value, and multiple stream elements can be associated with the same key.

那麼 Collectors 中的其他 36 個方法呢?它們中的大多數都允許你將流收集到 Map 中,這比將它們收集到真正的集合要複雜得多。每個流元素與一個鍵和一個值相關聯,多個流元素可以與同一個鍵相關聯。

The simplest map collector is toMap(keyMapper, valueMapper), which takes two functions, one of which maps a stream element to a key, the other, to a value. We used this collector in our fromString implementation in Item 34 to make a map from the string form of an enum to the enum itself:

最簡單的 Map 收集器是 toMap(keyMapper, valueMapper),它接受兩個函數,一個將流元素映射到鍵,另一個映射到值。我們在 Item-34 中的 fromString 實現中使用了這個收集器來創建枚舉的字符串形式到枚舉本身的映射:

// Using a toMap collector to make a map from string to enum
private static final Map<String, Operation> stringToEnum =Stream.of(values()).collect(toMap(Object::toString, e -> e));

This simple form of toMap is perfect if each element in the stream maps to a unique key. If multiple stream elements map to the same key, the pipeline will terminate with an IllegalStateException.

如果流中的每個元素映射到唯一的鍵,那麼這種簡單的 toMap 形式就是完美的。如果多個流元素映射到同一個鍵,管道將以 IllegalStateException 結束。

The more complicated forms of toMap, as well as the groupingBy method, give you various ways to provide strategies for dealing with such collisions. One way is to provide the toMap method with a merge function in addition to its key and value mappers. The merge function is a BinaryOperator<V>, where V is the value type of the map. Any additional values associated with a key are combined with the existing value using the merge function, so, for example, if the merge function is multiplication, you end up with a value that is the product of all the values associated with the key by the value mapper.

toMap 更爲複雜的形式,以及 groupingBy 方法,提供了各種方法來提供處理此類衝突的策略。一種方法是爲 toMap 方法提供一個 merge 函數,以及它的鍵和值映射器。merge 函數是一個 BinaryOperator<V>,其中 V 是 Map 的值類型。與鍵關聯的任何附加值都將使用 merge 函數與現有值組合,因此,例如,如果 merge 函數是乘法,那麼你將得到一個值,該值是 value mapper 與鍵關聯的所有值的乘積。

The three-argument form of toMap is also useful to make a map from a key to a chosen element associated with that key. For example, suppose we have a stream of record albums by various artists, and we want a map from recording artist to best-selling album. This collector will do the job.

toMap 的三參數形式對於從鍵到與該鍵關聯的所選元素的映射也很有用。例如,假設我們有一個由不同藝術家錄製的唱片流,並且我們想要一個從唱片藝術家到暢銷唱片的映射。這個 collector 將完成這項工作。

// Collector to generate a map from key to chosen element for key
Map<Artist, Album> topHits = albums.collect(
        toMap(Album::artist, a->a, maxBy(comparing(Album::sales)
    )
));

Note that the comparator uses the static factory method maxBy, which is statically imported from BinaryOperator. This method converts a Comparator<T> into a BinaryOperator<T> that computes the maximum implied by the specified comparator. In this case, the comparator is returned by the comparator construction method comparing, which takes the key extractor function Album::sales. This may seem a bit convoluted, but the code reads nicely. Loosely speaking, it says, “convert the stream of albums to a map, mapping each artist to the album that has the best album by sales.” This is surprisingly close to the problem statement.

注意,比較器使用靜態工廠方法 maxBy,該方法從 BinaryOperator 靜態導入。此方法將 Comparator<T> 轉換爲 BinaryOperator<T>,該操作符計算指定比較器所隱含的最大值。在這種情況下,比較器是通過比較器構造方法返回的,比較器構造方法取 Album::sales。這看起來有點複雜,但是代碼可讀性很好。粗略地說,代碼是這樣描述的:「將專輯流轉換爲 Map,將每個藝人映射到銷量最好的專輯。」這與問題的文字陳述驚人地接近。

Another use of the three-argument form of toMap is to produce a collector that imposes a last-write-wins policy when there are collisions. For many streams, the results will be nondeterministic, but if all the values that may be associated with a key by the mapping functions are identical, or if they are all acceptable, this collector’s s behavior may be just what you want:

toMap 的三參數形式的另一個用途是生成一個 collector,當發生衝突時,它強制執行 last-write-wins 策略。對於許多流,結果將是不確定的,但如果映射函數可能與鍵關聯的所有值都是相同的,或者它們都是可接受的,那麼這個 collector 的行爲可能正是你想要的:

// Collector to impose last-write-wins policy
toMap(keyMapper, valueMapper, (v1, v2) -> v2)

The third and final version of toMap takes a fourth argument, which is a map factory, for use when you want to specify a particular map implementation such as an EnumMap or a TreeMap.

toMap 的第三個也是最後一個版本採用了第四個參數,這是一個 Map 工廠,當你想要指定一個特定的 Map 實現(如 EnumMap 或 TreeMap)時,可以使用它。

There are also variant forms of the first three versions of toMap, named toConcurrentMap, that run efficiently in parallel and produce ConcurrentHashMap instances.

還有前三個版本的 toMap 的變體形式,名爲 toConcurrentMap,它們可以有效地並行運行,同時生成 ConcurrentHashMap 實例。

In addition to the toMap method, the Collectors API provides the groupingBy method, which returns collectors to produce maps that group elements into categories based on a classifier function. The classifier function takes an element and returns the category into which it falls. This category serves as the element’s map key. The simplest version of the groupingBy method takes only a classifier and returns a map whose values are lists of all the elements in each category. This is the collector that we used in the Anagram program in Item 45 to generate a map from alphabetized word to a list of the words sharing the alphabetization:

除了 toMap 方法之外,collector API 還提供 groupingBy 方法,該方法返回 collector,以生成基於分類器函數將元素分組爲類別的映射。分類器函數接受一個元素並返回它所屬的類別。這個類別用作元素的 Map 鍵。groupingBy 方法的最簡單版本只接受一個分類器並返回一個 Map,其值是每個類別中所有元素的列表。這是我們在 Item-45 的字謎程序中使用的收集器,用於生成從按字母順序排列的單詞到共享字母順序的單詞列表的映射:

words.collect(groupingBy(word -> alphabetize(word)))

If you want groupingBy to return a collector that produces a map with values other than lists, you can specify a downstream collector in addition to a classifier. A downstream collector produces a value from a stream containing all the elements in a category. The simplest use of this parameter is to pass toSet(), which results in a map whose values are sets of elements rather than lists.

如果你希望 groupingBy 返回一個使用列表之外的值生成映射的收集器,你可以指定一個下游收集器和一個分類器。下游收集器從包含類別中的所有元素的流中生成一個值。這個參數最簡單的用法是傳遞 toSet(),這會生成一個 Map,其值是 Set,而不是 List。

Alternatively, you can pass toCollection(collectionFactory), which lets you create the collections into which each category of elements is placed. This gives you the flexibility to choose any collection type you want. Another simple use of the two-argument form of groupingBy is to pass counting() as the downstream collector. This results in a map that associates each category with the number of elements in the category, rather than a collection containing the elements. That’s what you saw in the frequency table example at the beginning of this item:

或者,你可以傳遞 toCollection(collectionFactory),它允許你創建集合,將每個類別的元素放入其中。這使你可以靈活地選擇所需的任何集合類型。groupingBy 的兩參數形式的另一個簡單用法是將 counting() 作爲下游收集器傳遞。這將生成一個 Map,該 Map 將每個類別與類別中的元素數量相關聯,而不是包含元素的集合。這是你在這一項開始的 freq 表例子中看到的:

Map<String, Long> freq = words.collect(groupingBy(String::toLowerCase, counting()));

The third version of groupingBy lets you specify a map factory in addition to a downstream collector. Note that this method violates the standard telescoping argument list pattern: the mapFactory parameter precedes, rather than follows, the downStream parameter. This version of groupingBy gives you control over the containing map as well as the contained collections, so, for example, you can specify a collector that returns a TreeMap whose values are TreeSets.

groupingBy 的第三個版本允許你指定除了下游收集器之外的 Map 工廠。注意,這個方法違反了標準的可伸縮參數列表模式:mapFactory 參數位於下游參數之前,而不是之後。groupingBy 的這個版本允許你控制包含的 Map 和包含的集合,因此,例如,你可以指定一個收集器,該收集器返回一個 TreeMap,其值爲 TreeSet。

The groupingByConcurrent method provides variants of all three overloadings of groupingBy. These variants run efficiently in parallel and produce ConcurrentHashMap instances. There is also a rarely used relative of groupingBy called partitioningBy. In lieu of a classifier method, it takes a predicate and returns a map whose key is a Boolean. There are two overloadings of this method, one of which takes a downstream collector in addition to a predicate.

groupingByConcurrent 方法提供了 groupingBy 的所有三種重載的變體。這些變體可以有效地並行運行,並生成 ConcurrentHashMap 實例。還有一個與 groupingBy 關係不大的詞,叫做 partitioningBy 。代替分類器方法,它接受一個 Predicate 並返回一個鍵爲布爾值的 Map。此方法有兩個重載,其中一個除了 Predicate 外還接受下游收集器。

The collectors returned by the counting method are intended only for use as downstream collectors. The same functionality is available directly on Stream, via the count method, so there is never a reason to say collect(counting()). There are fifteen more Collectors methods with this property. They include the nine methods whose names begin with summing, averaging, and summarizing (whose functionality is available on the corresponding primitive stream types). They also include all overloadings of the reducing method, and the filtering, mapping, flatMapping, and collectingAndThen methods. Most programmers can safely ignore the majority of these methods. From a design perspective, these collectors represent an attempt to partially duplicate the functionality of streams in collectors so that downstream collectors can act as “ministreams.”

計數方法返回的收集器僅用於作爲下游收集器。相同的功能可以通過 count 方法直接在流上使用,所以永遠沒有理由說 collect(counting()) 還有 15 個具有此屬性的收集器方法。它們包括 9 個方法,它們的名稱以求和、平均和彙總開頭(它們的功能在相應的原始流類型上可用)。它們還包括 reduce 方法的所有重載,以及過濾、映射、平面映射和 collectingAndThen 方法。大多數程序員可以安全地忽略這些方法中的大多數。從設計的角度來看,這些收集器試圖部分複製收集器中的流的功能,以便下游收集器可以充當「迷你存儲器」。

There are three Collectors methods we have yet to mention. Though they are in Collectors, they don’t involve collections. The first two are minBy and maxBy, which take a comparator and return the minimum or maximum element in the stream as determined by the comparator. They are minor generalizations of the min and max methods in the Stream interface and are the collector analogues of the binary operators returned by the like-named methods in BinaryOperator. Recall that we used BinaryOperator.maxBy in our best-selling album example.

我們還沒有提到三種 Collectors 方法。雖然它們是在 Collectors 中,但它們不涉及收集。前兩個是 minBy 和 maxBy,它們接受 comparator 並返回由 comparator 確定的流中的最小或最大元素。它們是流接口中最小和最大方法的一些小泛化,是 BinaryOperator 中同名方法返回的二進制操作符的 collector 類似物。回想一下,在我們最暢銷的專輯示例中,我們使用了 BinaryOperator.maxBy

The final Collectors method is joining, which operates only on streams of CharSequence instances such as strings. In its parameterless form, it returns a collector that simply concatenates the elements. Its one argument form takes a single CharSequence parameter named delimiter and returns a collector that joins the stream elements, inserting the delimiter between adjacent elements. If you pass in a comma as the delimiter, the collector returns a comma-separated values string (but beware that the string will be ambiguous if any of the elements in the stream contain commas). The three argument form takes a prefix and suffix in addition to the delimiter. The resulting collector generates strings like the ones that you get when you print a collection, for example [came, saw, conquered].

最後一個 Collectors 方法是 join,它只對 CharSequence 實例流(如字符串)執行操作。在其無參數形式中,它返回一個收集器,該收集器只是將元素連接起來。它的一個參數形式接受一個名爲 delimiter 的 CharSequence 參數,並返回一個連接流元素的收集器,在相鄰元素之間插入分隔符。如果傳入逗號作爲分隔符,收集器將返回逗號分隔的值字符串(但是要注意,如果流中的任何元素包含逗號,該字符串將是不明確的)。除了分隔符外,三參數形式還接受前綴和後綴。生成的收集器生成的字符串與打印集合時得到的字符串類似,例如 [came, saw, conquer]

In summary, the essence of programming stream pipelines is side-effect-free function objects. This applies to all of the many function objects passed to streams and related objects. The terminal operation forEach should only be used to report the result of a computation performed by a stream, not to perform the computation. In order to use streams properly, you have to know about collectors. The most important collector factories are toList, toSet, toMap, groupingBy, and joining.

總之,流管道編程的本質是無副作用的函數對象。這適用於傳遞給流和相關對象的所有函數對象。Terminal 操作 forEach 只應用於報告由流執行的計算結果,而不應用於執行計算。爲了正確使用流,你必須瞭解 collector。最重要的 collector 工廠是 toList、toSet、toMap、groupingBy 和 join。


Back to contents of the chapter(返回章節目錄)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章