Java 7、8中的String.intern（3）

本文由 ImportNew - 文學敏翻譯自 java-performance。歡迎加入Java小組。轉載請參見文章末尾的要求。

我想再回到之前（第一部分、第二部分）討論過的String.intern方法。過去的幾個月，我在自己的業餘項目中大量使用intern方法，主要是爲了研究爲每個非暫存String對象使用String.intern方法的利弊（非暫存是指對象的生存期能達到數秒以上，而且很有可能進入老年代回收區）。

我之前也提到過，Java 7、8中String.intern的優點是：

    執行非常快，在多線程模式中（仍然使用全局字符串池）幾乎沒有性能損失
    節省內存，允許你的數據集更小，（通常會）讓你的程序運行更快

這個方法的主要缺點是（之前也提過）：

    需要提前設置JVM的-XX:StringTableSize=N參數，字符串池使用這個固定的值（要擴展JVM的字符串池，需要重啓虛擬機）
    在整個程序的很多地方需要加入String.intern的調用（可能通過你自己的封裝去調用）——這增加了代碼的維護代價

經過幾個月在我項目中使用String.intern，我覺得這個方法應該用在只有有限值的域上（比如人名、州/省名）。我們不應該在一些很可能不會重複使用的對象上使用intern方法——這會浪費CPU時間。

舉例來說，假設你正在給政府寫一個個人資料管理工具（與社交網絡註冊信息比較而言，你會有很多非空的域）。

如果你不得不在內存中保存所有的數據，那麼使用intern是很有意義的：

    人的名字 – 即使在多民族國家，比如澳大利亞，多數民族（人口佔多數的民族）的數量很少。這使得在用的人名總數在幾千以下，而常用的名字甚至少於1000。
    人的姓氏 – 在中國重複性大，其他國家就不太好，但重複的概率已經足夠好了。
    公寓號 – 在大部分國家，公寓號可能包含字母，但通常是從1遞增的數字，也就是說只有有限數目的數字。
    街道名(去掉街道類型，比如‘road’/’avenue’/’street’) – 它們的數量很少
    州/地區/省 – 只有一些

另一方面，如果你沒法將所有數據分割爲小塊，那最好不要使用intern。舉例來說，街道地址的完整名稱，像“100 King st”，要比分隔開的“100”或者“King”更唯一。

我們在JDK中的HashMap中分別添加字符串和使用intern的字符串，並對二者做比較。這或多或少地可以顯示出將intern作用於唯一性的字符串會產生更多代價。我將使用我的工作站來測試，CPU型號爲Intel Xeon E5-2650（8核16線程，2GHz），128G內存，並把-Xmx和-Xms設置爲同樣的值以減少垃圾回收次數
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

private static void testInsertVsIntern()
{
    //in order to compile these methods
    testMapInsertion( 100 * 1000 );
    testMapInsertionIntern( 100 * 1000 );
    System.gc();

    System.out.println( "Now real run" );

    testMapInsertion( 50 * 1000 * 1000 + 100 );
    System.gc();
    testMapInsertionIntern( 50 * 1000 * 1000 + 100 );
}

private static void testMapInsertion( final int cnt )
{
    final Map<Integer, String> map = new HashMap<Integer, String>( cnt );
    long start = System.currentTimeMillis();
    for ( int i = 0; i < cnt; ++i )
    {
        final String str = Integer.toString( i );
        map.put( i, str );
        if ( i % 1000000 == 0 ) //1M
        {
            System.out.println( i + "; time (insert) = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
            start = System.currentTimeMillis();
        }
    }
    System.out.println( "Total length = " + map.size() );
}

private static void testMapInsertionIntern( final int cnt )
{
    final Map<Integer, String> map = new HashMap<Integer, String>( cnt );
    long start = System.currentTimeMillis();
    for ( int i = 0; i < cnt; ++i )
    {
        final String str = Integer.toString( i );
        map.put( i, str.intern() ); //here is the difference!
        if ( i % 1000000 == 0 ) //1M
        {
            System.out.println( i + "; time (intern) = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
            start = System.currentTimeMillis();
        }
    }
    System.out.println( "Total length = " + map.size() );
}

如你所見，兩個測試方法的唯一區別是testMapInsertionIntern方法調用了String.intern()。兩個方法其他部分都一樣。

第一個測試只是往map中添加Integer、String鍵值對。整個測試用了0.065-0.07秒添加了100,0000個鍵值對（這個時間也包括整型到字符串的轉化），也就是說插入速度穩定在16M鍵值對每秒。

我使用-XX:StringTableSize=1000003設置了虛擬機的字符串池。我得到了以下結果（測試中只有一次minor gc）：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

1000000; time (intern) = 0.231 sec
2000000; time (intern) = 0.251 sec
3000000; time (intern) = 0.268 sec
4000000; time (intern) = 0.285 sec
5000000; time (intern) = 0.311 sec
6000000; time (intern) = 0.333 sec
7000000; time (intern) = 0.369 sec
8000000; time (intern) = 0.399 sec
9000000; time (intern) = 0.444 sec
10000000; time (intern) = 0.507 sec
11000000; time (intern) = 0.532 sec
12000000; time (intern) = 0.614 sec
13000000; time (intern) = 0.686 sec
14000000; time (intern) = 0.797 sec
15000000; time (intern) = 0.837 sec
16000000; time (intern) = 0.902 sec
17000000; time (intern) = 0.962 sec
18000000; time (intern) = 1.019 sec
19000000; time (intern) = 1.083 sec
20000000; time (intern) = 1.121 sec
21000000; time (intern) = 1.204 sec
22000000; time (intern) = 1.226 sec
23000000; time (intern) = 1.292 sec
24000000; time (intern) = 1.312 sec
25000000; time (intern) = 1.379 sec
26000000; time (intern) = 1.444 sec
27000000; time (intern) = 1.491 sec
28000000; time (intern) = 1.542 sec
29000000; time (intern) = 1.569 sec
30000000; time (intern) = 1.732 sec
31000000; time (intern) = 1.74 sec
32000000; time (intern) = 1.735 sec
33000000; time (intern) = 1.842 sec
34000000; time (intern) = 1.893 sec
35000000; time (intern) = 1.989 sec
36000000; time (intern) = 1.971 sec
37000000; time (intern) = 2.033 sec
38000000; time (intern) = 2.139 sec
[GC 4195274K->4207538K(16078208K), 5.2907230 secs]
39000000; time (intern) = 7.46 sec
40000000; time (intern) = 2.259 sec
41000000; time (intern) = 2.28 sec
42000000; time (intern) = 2.346 sec
43000000; time (intern) = 2.394 sec
44000000; time (intern) = 2.414 sec
45000000; time (intern) = 2.492 sec
46000000; time (intern) = 2.536 sec
47000000; time (intern) = 2.619 sec
48000000; time (intern) = 2.654 sec
49000000; time (intern) = 2.673 sec
50000000; time (intern) = 2.775 sec

可以看到，處理最開始的100M的字符串所用時間（是不使用intern）的3.5倍，接下來處理的字符串使用的時間更多。回到前邊人名、地址的例子，就意味着處理完整的街道名將花費3.5到4倍的時間，而沒有其他好處（大部分這樣的街道名是唯一的）。
相關文章

String.intern in Java 6, 7 and 8 – string pooling文章描述了Java 7、8中String.intern()的實現與使用的益處。

String.intern in Java 6, 7 and 8 – multithreaded access 文章描述了在多線程中使用Sring.intern()的性能特點。
總結

儘管在Java 7以上對String.intern()做了很細緻的優化，但它耗費的時間仍是很顯著的（尤其對CPU密集型程序）。文章中的簡單例子中，沒有調用String.intern()的測試要快3.5倍左右。爲穩定起見，你最好不要在每個存活期長的字符串使用String.intern()方法。然而可以使用intern處理只有有限值的域（比如州/省）- 這種情形下節省的內存可以抵消初始CPU的代價。

文章轉載自：http://www.importnew.com/12681.html

原文鏈接： java-performance 翻譯： ImportNew.com - 文學敏
譯文鏈接： http://www.importnew.com/12681.html
[ 轉載請保留原文出處、譯者和譯文鏈接。]

Java 7、8中的String.intern（3）

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

primary key 與 unique 的區別

設計模式學習日誌八：觀察者模式（原文轉載）

Difference between WeakReference vs SoftReference vs PhantomReference vs Strong reference in Java

設計模式學習日誌之十三：迭代器模式 (原文轉載）

設計模式學習日誌九：訪問者模式（原文轉載）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結