Java中的字符串存儲方式

1. 簡介

十一放假期間在脈脈上看見一道面試題討論的很火熱:

Java中字符串是如何存儲的?

這一問題看似簡單,但是背後卻隱藏了很多深層機制,本文將逐一介紹相關技術原理。

2. 字符串類

字符串廣泛應用於Java編程中,在Java中字符串屬於對象,Java提供了String 類來創建和操作字符串。

2.1 java.lang.String

java.lang.String成員變量如下:

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {
    /** The value is used for character storage. */
    private final char value[];

    /** Cache the hash code for the string */
    private int hash; // Default to 0

    /** use serialVersionUID from JDK 1.0.2 for interoperability */
    private static final long serialVersionUID = -6849794470754667710L;

String類包含了兩個私有的final變量,int hash用於緩存hash值,char數組用於儲存數據。

2.2 char

Java運行時,char按UTF-16編碼,一個char需要佔用兩個字節(生僻字除外)。

2.3 數組

JVM中數組相關的類主要包括:
ArrayKlass
arrayOopDesc
分別對應與類的元數據和類的實例數據,ArrayKlass和arrayOopDesc分別是Klass和oopDesc的子類,也就意味着,Java中數組同樣也是一類對象。

以及他們的子類,TypeArrayKlass和typeArrayOopDesc用來描述基本類型數組,而ObjArrayKlass和objArrayOopDesc用來描述對象數組。
TypeArrayKlass
typeArrayOopDesc
ObjArrayKlass
objArrayOopDesc

class ArrayKlass: public Klass {
  friend class VMStructs;
 private:
  // If you add a new field that points to any metaspace object, you
  // must add this field to ArrayKlass::metaspace_pointers_do().
  int      _dimension;         // 數組的維度
  Klass* volatile _higher_dimension;  // 數組元素的Klass描述
  Klass* volatile _lower_dimension;   // 
}
// arrayOopDesc類主要負責維護下面的信息
//
//  管理對象頭
//  指向Klass的指針
//  數組長度
class arrayOopDesc : public oopDesc {

}

答出以上這些,應付我們的面試題勉強是夠了,但是由於字符串是運行時大量使用的對象,JVM針對字符串進行了大量的優化,主要有String.intern()方法和G1的字符串去重。

3. String.intern

Java引用了String.intern()方法來解決字符串冗餘的問題。開發者需要顯式調用該方法,該方法會將字符串對象存儲到一個StringTable哈希表中,具體實現如下:

java.lang.String

// 這是一個native方法,通過JNI調用到c代碼
public native String intern();

String.c

#include "jvm.h"
#include "java_lang_String.h"

JNIEXPORT jobject JNICALL
Java_java_lang_String_intern(JNIEnv *env, jobject this)
{
    // 調用了JVM_InternString方法
    return JVM_InternString(env, this);
}

JNIEXPORT jboolean JNICALL
Java_java_lang_StringUTF16_isBigEndian(JNIEnv *env, jclass cls)
{
  unsigned int endianTest = 0xff000000;
  if (((char*)(&endianTest))[0] != 0) {
    return JNI_TRUE;
  } else {
    return JNI_FALSE;
  }
}

jvm.cpp


// String support ///////////////////////////////////////////////////////////////////////////

JVM_ENTRY(jstring, JVM_InternString(JNIEnv *env, jstring str))
  JVMWrapper("JVM_InternString");
  JvmtiVMObjectAllocEventCollector oam;
  if (str == NULL) return NULL;
  oop string = JNIHandles::resolve_non_null(str);
  // 調用StringTable的intern方法
  oop result = StringTable::intern(string, CHECK_NULL);
  return (jstring) JNIHandles::make_local(env, result);
JVM_END

stringTable.cpp

// 存放字符串緩存的哈希表
static CompactHashtable<
  const jchar*, oop,
  read_string_from_compact_hashtable,
  java_lang_String::equals
> _shared_table;

oop StringTable::intern(Handle string_or_null_h, const jchar* name, int len, TRAPS) {
  // 獲取字符串的hash code
  unsigned int hash = java_lang_String::hash_code(name, len);
  // 根據hash code、char數組、長度在哈希表中查找是否已經存在
  oop found_string = StringTable::the_table()->lookup_shared(name, len, hash);
  if (found_string != NULL) {
    // 如在哈希表中已經存在則直接返回
    return found_string;
  }
  if (StringTable::_alt_hash) {
    hash = hash_string(name, len, true);
  }

  // 如果在哈希表中不存在,則調用do_intern方法,將字符串對象緩存入哈希表中
  return StringTable::the_table()->do_intern(string_or_null_h, name, len,
                                             hash, CHECK_NULL);
}

oop StringTable::do_intern(Handle string_or_null_h, const jchar* name,
                           int len, uintx hash, TRAPS) {
  HandleMark hm(THREAD);  // cleanup strings created
  Handle string_h;

  if (!string_or_null_h.is_null()) {
    string_h = string_or_null_h;
  } else {
    string_h = java_lang_String::create_from_unicode(name, len, CHECK_NULL);
  }

  // Deduplicate the string before it is interned. Note that we should never
  // deduplicate a string after it has been interned. Doing so will counteract
  // compiler optimizations done on e.g. interned string literals.
  Universe::heap()->deduplicate_string(string_h());

  assert(java_lang_String::equals(string_h(), name, len),
         "string must be properly initialized");
  assert(len == java_lang_String::length(string_h()), "Must be same length");

  StringTableLookupOop lookup(THREAD, hash, string_h);
  StringTableGet stg(THREAD);

  bool rehash_warning;
  do {
    if (_local_table->get(THREAD, lookup, stg, &rehash_warning)) {
      update_needs_rehash(rehash_warning);
      return stg.get_res_oop();
    }
    WeakHandle<vm_string_table_data> wh = WeakHandle<vm_string_table_data>::create(string_h);
    // The hash table takes ownership of the WeakHandle, even if it's not inserted.
    if (_local_table->insert(THREAD, lookup, wh, &rehash_warning)) {
      update_needs_rehash(rehash_warning);
      return wh.resolve();
    }
  } while(true);
}

4. G1的字符串去重

爲了降低內存的使用,JVM能夠自動優化字符串對象,如果字符串對象的char[]數組重複,則JVM後臺自動的將其指向同一段內存地址。G1會在YGC和Full GC的標記階段執行該邏輯。該特性是JEP 192引入的。

字符串去重與String.intern()存在兩點區別:

  • String.intern()需要顯式調用,而字符串去重是JVM自動執行的
  • String.intern()共享的是字符串對象,而字符串去重共享的是char[]

g1StringDedup.cpp

// 標記時,判斷是否字符串去重的候選者
bool G1StringDedup::is_candidate_from_mark(oop obj) {
  if (java_lang_String::is_instance_inlined(obj)) {
    bool from_young = G1CollectedHeap::heap()->heap_region_containing(obj)->is_young();
    // 源Region屬於新生代,且對象年齡小於閾值,返回TRUE
    if (from_young && obj->age() < StringDeduplicationAgeThreshold) {
      // Candidate found. String is being evacuated from young to old but has not
      // reached the deduplication age threshold, i.e. has not previously been a
      // candidate during its life in the young generation.
      return true;
    }
  }

  // Not a candidate
  return false;
}

// 疏散時,判斷是否字符串去重的候選者
bool G1StringDedup::is_candidate_from_evacuation(bool from_young, bool to_young, oop obj) {
  if (from_young && java_lang_String::is_instance_inlined(obj)) {
    // 源Region屬於新生代,目的地Region屬於新生代,且對象年齡等於閾值
    if (to_young && obj->age() == StringDeduplicationAgeThreshold) {
      // Candidate found. String is being evacuated from young to young and just
      // reached the deduplication age threshold.
      return true;
    }
    // 源Region屬於新生代,目的地Region屬於老年代,且對象年齡小於閾值;Full GC時,所有Region都被標記爲老年代
    if (!to_young && obj->age() < StringDeduplicationAgeThreshold) {
      // Candidate found. String is being evacuated from young to old but has not
      // reached the deduplication age threshold, i.e. has not previously been a
      // candidate during its life in the young generation.
      return true;
    }
  }

當判斷爲candidate後,先將對象寫入一個臨時隊列,由另外一個線程處理字符串去重。

//
// Task for parallel unlink_or_oops_do() operation on the deduplication queue
// and table.
//
class G1StringDedupUnlinkOrOopsDoTask : public AbstractGangTask {
private:
  G1StringDedupUnlinkOrOopsDoClosure _cl;
  G1GCPhaseTimes* _phase_times;

public:
  G1StringDedupUnlinkOrOopsDoTask(BoolObjectClosure* is_alive,
                                  OopClosure* keep_alive,
                                  bool allow_resize_and_rehash,
                                  G1GCPhaseTimes* phase_times) :
    AbstractGangTask("G1StringDedupUnlinkOrOopsDoTask"),
    _cl(is_alive, keep_alive, allow_resize_and_rehash), _phase_times(phase_times) { }

  virtual void work(uint worker_id) {
    {
      G1GCParPhaseTimesTracker x(_phase_times, G1GCPhaseTimes::StringDedupQueueFixup, worker_id);
      StringDedupQueue::unlink_or_oops_do(&_cl);
    }
    {
      G1GCParPhaseTimesTracker x(_phase_times, G1GCPhaseTimes::StringDedupTableFixup, worker_id);
      StringDedupTable::unlink_or_oops_do(&_cl, worker_id);
    }
  }
};

5. 引用

OpenJDK 12 源代碼
JEP 192

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章