Alluxio基於冷熱數據分離的元數據管理策略

前言


上篇文章末尾,筆者聊到了一種叫做分層元數據管理模式。它主張的思想是將元數據進行分級對待,比如Cache+Persist層2種,cache拿來用於熱點數據的訪問,而persist層即持久層則存儲那些冷的訪問不頻繁的數據,以此達到元數據的強擴展性和一個較好的訪問性能。當今存儲系統Alluxio就是使用了這種分層級對待的元數據管理模式。本文我們就來簡單聊聊Alluxio的tier layer的元數據管理。

Alluxio內部元數據管理架構


相比較於將元數據全部load到memory然後以此提高快速訪問能力的元數據管理方式,Alluxio在這點上做了優化改進,只cache那些active的數據,這是其內部元數據管理的一大特點。對於那些近期沒有訪問過的冷數據,則保存在本地的rocksdb內。

在Alluxio中,有專門的定義來定義上述元數據的存儲,在內存中cache active數據的存儲層,我們叫做cache store,底層rocksdb層則叫做baking store。

Alluxio就是基於上面提到的2層store做數據數據然後對外提供數據訪問能力,架構圖如下所示:
在這裏插入圖片描述

本文筆者這裏想重點聊的點在於Cache store如何和上面Rocks store(Baking store)進行數據交互的。

Alluxio的支持異步寫出功能的自定義Cache實現


在Cache store層,它需要做以下2件事情來保證元數據的正常更新:

  • 及時將那些訪問頻率降低的熱點數據移除並寫出到baking store裏去。
  • 有新的數據訪問來時,將這些數據從baking store讀出來並加載到cache裏去。

在上面兩點中,毫無疑問,第一點是Alluxio具體要實現。那麼Alluxio採用的是什麼辦法呢?用現有成熟Cache工具,guava cache?Guava cache自帶expireAfterAccess能很好的滿足上述的使用場景。

不過最終Alluxio並沒有使用Guava cahce的方案。這點筆者認爲主要的一點在於guava cahce不支持異步的entry過期寫出功能。Gauva cache在更新過期entry時並沒有開啓額外線程的方式來做過期entry的處理,而是放在了後面的每次的cache訪問操作裏順帶做了。那麼這裏其實會有一個隱患:當cache很久沒有被訪問過了,然後下一次cache訪問發生在已經超過大部分entry的過期時間之後,那麼這時候可能會觸發大量的cache更新重新加載的行爲。此時Guava Cache本身將會消耗掉很多的CPU來做這樣的事情,這也勢必會影響Cache對外提供數據訪問的能力。另外一點,Gauva Cache的entry更新是要帶鎖的,如果Cache entry更新的緩慢是會block住其它想要訪問此entry的thread的。

結論是說,如果我們想要Cache entry能夠被及時的移除以及更新,可以自己實現一個thread來觸發更新的行爲。下面是Guava cache的Git文檔對這塊的一個說明解釋,裏面也提到了爲什麼Guava Cahce爲什麼不在內部實現啓動線程來做cache過期更新的原因:

When Does Cleanup Happen?
Caches built with CacheBuilder do not perform cleanup and evict values "automatically," or instantly after a value expires, or anything of the sort. Instead, it performs small amounts of maintenance during write operations, or during occasional read operations if writes are rare.

The reason for this is as follows: if we wanted to perform Cache maintenance continuously, we would need to create a thread, and its operations would be competing with user operations for shared locks. Additionally, some environments restrict the creation of threads, which would make CacheBuilder unusable in that environment.

Instead, we put the choice in your hands. If your cache is high-throughput, then you don't have to worry about performing cache maintenance to clean up expired entries and the like. If your cache does writes only rarely and you don't want cleanup to block cache reads, you may wish to create your own maintenance thread that calls Cache.cleanUp() at regular intervals.

If you want to schedule regular cache maintenance for a cache which only rarely has writes, just schedule the maintenance using ScheduledExecutorService.

OK,下面我們就來看看Alluxio內部實現的帶異步寫出outdated entry功能的cache實現。這裏我們對着其代碼實現做具體闡述。

首先是上面架構圖中的CachingInodeStore的定義:

public final class CachingInodeStore implements InodeStore, Closeable {
  private static final Logger LOG = LoggerFactory.getLogger(CachingInodeStore.class);
  // Backing store用戶數據寫出持久化的store
  private final InodeStore mBackingStore;
  private final InodeLockManager mLockManager;

  // Cache recently-accessed inodes.
  @VisibleForTesting
  final InodeCache mInodeCache;

  // Cache recently-accessed inode tree edges.
  @VisibleForTesting
  final EdgeCache mEdgeCache;

  @VisibleForTesting
  final ListingCache mListingCache;

  // Starts true, but becomes permanently false if we ever need to spill metadata to the backing
  // store. When true, we can optimize lookups for non-existent inodes because we don't need to
  // check the backing store. We can also optimize getChildren by skipping the range query on the
  // backing store.
  private volatile boolean mBackingStoreEmpty;
  ...
  
  public CachingInodeStore(InodeStore backingStore, InodeLockManager lockManager) {
    mBackingStore = backingStore;
    mLockManager = lockManager;
    AlluxioConfiguration conf = ServerConfiguration.global();
    int maxSize = conf.getInt(PropertyKey.MASTER_METASTORE_INODE_CACHE_MAX_SIZE);
    Preconditions.checkState(maxSize > 0,
        "Maximum cache size %s must be positive, but is set to %s",
        PropertyKey.MASTER_METASTORE_INODE_CACHE_MAX_SIZE.getName(), maxSize);
    float highWaterMarkRatio = ConfigurationUtils.checkRatio(conf,
        PropertyKey.MASTER_METASTORE_INODE_CACHE_HIGH_WATER_MARK_RATIO);
    // 最高水位的計算
    int highWaterMark = Math.round(maxSize * highWaterMarkRatio);
    float lowWaterMarkRatio = ConfigurationUtils.checkRatio(conf,
        PropertyKey.MASTER_METASTORE_INODE_CACHE_LOW_WATER_MARK_RATIO);
    Preconditions.checkState(lowWaterMarkRatio <= highWaterMarkRatio,
        "low water mark ratio (%s=%s) must not exceed high water mark ratio (%s=%s)",
        PropertyKey.MASTER_METASTORE_INODE_CACHE_LOW_WATER_MARK_RATIO.getName(), lowWaterMarkRatio,
        PropertyKey.MASTER_METASTORE_INODE_CACHE_HIGH_WATER_MARK_RATIO, highWaterMarkRatio);
    // 最低水位的計算
    int lowWaterMark = Math.round(maxSize * lowWaterMarkRatio);

    mBackingStoreEmpty = true;
    CacheConfiguration cacheConf = CacheConfiguration.newBuilder().setMaxSize(maxSize)
        .setHighWaterMark(highWaterMark).setLowWaterMark(lowWaterMark)
        .setEvictBatchSize(conf.getInt(PropertyKey.MASTER_METASTORE_INODE_CACHE_EVICT_BATCH_SIZE))
        .build();
    // 將上述cache相關配置值傳入cache中
    mInodeCache = new InodeCache(cacheConf);
    mEdgeCache = new EdgeCache(cacheConf);
    mListingCache = new ListingCache(cacheConf);
  }

這裏我們主要看mInodeCache這個cache,它保存了最近訪問過的inode。

  class InodeCache extends Cache<Long, MutableInode<?>> {
    public InodeCache(CacheConfiguration conf) {
      super(conf, "inode-cache", MetricKey.MASTER_INODE_CACHE_SIZE);
    }
    ...
}

我們看到InodeCache底層繼承的是Cache<K, V>這個類,我們繼續進入這個類的實現,

public abstract class Cache<K, V> implements Closeable {
  private static final Logger LOG = LoggerFactory.getLogger(Cache.class);

  private final int mMaxSize;
  // cache的高水位值,噹噹前cache entry總數超過此值時,會觸發entry的寫出
  private final int mHighWaterMark;
  // cache的低水位值,每次cache寫出清理後的entry總數
  private final int mLowWaterMark;
  // 每次過期寫出entry的批量大小
  private final int mEvictBatchSize;
  private final String mName;
  // cache map,爲了保證線程安全,使用了ConcurrentHashMap
  @VisibleForTesting
  final ConcurrentHashMap<K, Entry> mMap;
  // TODO(andrew): Support using multiple threads to speed up backing store writes.
  // Thread for performing eviction to the backing store.
  @VisibleForTesting
  // entry移除寫出線程
  final EvictionThread mEvictionThread;
  ...

簡單而言,Alluxio的Cache類工作的本質模式是一個ConcurrentHashMap+EvictionThread的模式。因爲涉及到Map併發操作的情況,所以這裏使用了ConcurrentHashMap。然後再根據這裏閾值的定義(高低watermark值的設定),進行entry的寫出更新。

下面我們直接來看EvictionThread的操作邏輯,

class EvictionThread extends Thread {
    @VisibleForTesting
    volatile boolean mIsSleeping = true;

    // 存儲需要被清理出去的cache entry
    private final List<Entry> mEvictionCandidates = new ArrayList<>(mEvictBatchSize);
    private final List<Entry> mDirtyEvictionCandidates = new ArrayList<>(mEvictBatchSize);
    private final Logger mCacheFullLogger = new SamplingLogger(LOG, 10L * Constants.SECOND_MS);

    ...

    @Override
    public void run() {
      while (!Thread.interrupted()) {
        // 如果當前map總entry數未超過高水位置,則線程進行wait等待
        while (!overHighWaterMark()) {
          synchronized (mEvictionThread) {
            if (!overHighWaterMark()) {
              try {
                mIsSleeping = true;
                mEvictionThread.wait();
                mIsSleeping = false;
              } catch (InterruptedException e) {
                return;
              }
            }
          }
        }
        if (cacheIsFull()) {
          mCacheFullLogger.warn(
              "Metastore {} cache is full. Consider increasing the cache size or lowering the "
                  + "high water mark. size:{} lowWaterMark:{} highWaterMark:{} maxSize:{}",
              mName, mMap.size(), mLowWaterMark, mHighWaterMark, mMaxSize);
        }
        // 如果當前map總entry數超過高水位置,則開始準備進行entry的寫出清理,map entry數量清理至低水位置
        evictToLowWaterMark();
      }
    }
}

繼續進入evictToLowWaterMark方法,

    private void evictToLowWaterMark() {
      long evictionStart = System.nanoTime();
      // 計算此處entry移除會被移除的數量
      int toEvict = mMap.size() - mLowWaterMark;
      // 當前移除entry的計數累加值
      int evictionCount = 0;
      // 進行entry的寫出移除
      while (evictionCount < toEvict) {
        if (!mEvictionHead.hasNext()) {
          mEvictionHead = mMap.values().iterator();
        }
        // 遍歷mapentry,進行需要被移除的entry數的收集
        fillBatch(toEvict - evictionCount);
        // 進行entry的寫出清理
        evictionCount += evictBatch();
      }
      if (evictionCount > 0) {
        LOG.debug("{}: Evicted {} entries in {}ms", mName, evictionCount,
            (System.nanoTime() - evictionStart) / Constants.MS_NANO);
      }
    }

上面fillBatch的entry數收集過程如下所示,

    private void fillBatch(int count) {
      // 單次移除entry數的上限值設定
      int targetSize = Math.min(count, mEvictBatchSize);
      // 當待移除entry未達到目標值時,繼續遍歷map尋找未被引用的entry
      while (mEvictionCandidates.size() < targetSize && mEvictionHead.hasNext()) {
        Entry candidate = mEvictionHead.next();
        // 如果entry被外界引用,則將其引用值標記爲false,下次如果還遍歷到此entry,此entry將被收集移除
        // 當entry被會訪問時,其reference值會被標記爲true。
        if (candidate.mReferenced) {
          candidate.mReferenced = false;
          continue;
        }
        // 如果此entry已經被標記爲沒有引用,則加入到待移除entry列表內
        mEvictionCandidates.add(candidate);
        if (candidate.mDirty) {
          mDirtyEvictionCandidates.add(candidate);
        }
      }
    }

然後是entry寫出操作,

    private int evictBatch() {
      int evicted = 0;
      if (mEvictionCandidates.isEmpty()) {
        return evicted;
      }
      // 進行entry的寫出,entry分爲兩類
      // 如果entry值和baking store裏保存的是一致的話:則直接從map裏進行移除即可
      // 如果entry值和baking store對比是發生過更新的,則額外還需要進行flush寫出,然後map裏再進行移除
      flushEntries(mDirtyEvictionCandidates);
      for (Entry entry : mEvictionCandidates) {
        if (evictIfClean(entry)) {
          evicted++;
        }
      }
      mEvictionCandidates.clear();
      mDirtyEvictionCandidates.clear();
      return evicted;
    }

我們可以看到entry移除的過程其實還會被分出兩類,這其中取決於此entry值和baking store中持久化保存的值是否一致。

  • 第一類,只需從cache map中進行移除
  • 第二類,從cache map中進行移除,還需要寫出到baking store。

這裏是由cache Entry的dirty屬性值來確定的,

  protected class Entry {
    protected K mKey;
    // null value means that the key has been removed from the cache, but still needs to be removed
    // from the backing store.
    @Nullable
    protected V mValue;

    // Whether the entry is out of sync with the backing store. If mDirty is true, the entry must be
    // flushed to the backing store before it can be evicted.
    protected volatile boolean mDirty = true;
,,,

evictBatch的flushEntries方法取決於繼承子類如何實現baking store的寫出。

  /**
   * Attempts to flush the given entries to the backing store.
   *
   * The subclass is responsible for setting each candidate's mDirty field to false on success.
   *
   * @param candidates the candidate entries to flush
   */
  protected abstract void flushEntries(List<Entry> candidates)

Map entry的異步寫出過期entry過程說完了,我們再來看另一部分內容Entry的訪問操作get/put, delete的操作。

這裏我們以put操作爲例:

 /**
   * Writes a key/value pair to the cache.
   *
   * @param key the key
   * @param value the value
   */
  public void put(K key, V value) {
    mMap.compute(key, (k, entry) -> {
      // put操作callback接口方法
      onPut(key, value);
      // 如果是cache已經滿了,則直接寫出到baking store裏
      if (entry == null && cacheIsFull()) {
        writeToBackingStore(key, value);
        return null;
      }
      if (entry == null || entry.mValue == null) {
        onCacheUpdate(key, value);
        return new Entry(key, value);
      }
      // 進行entry的更新
      entry.mValue = value;
      // 標記entry reference引用值爲true,意爲近期此entry被訪問過,在get,remove方法中,也會更新此屬性值爲true
      entry.mReferenced = true;
      // 標記此數據爲dirty,意爲從baking load此entry值後,此值發生過更新
      entry.mDirty = true;
      return entry;
    });
    // 隨後通知Eviction線程,判斷是否需要進行entry的移除,在get,remove方法中,也會在末尾調用此方法
    wakeEvictionThreadIfNecessary();
  }

在上面方法的最後一行邏輯,會第一時間激活Eviction線程來做entry的移除操作,這樣就不會存在前文說的短期內可能大量entry的寫出移除操作了。這點和Guava cache的過期更新策略是不同的。

以上就是本文所講述的主要內容了,其中大量篇幅介紹的是Alluxio內部Cache功能的實現,更詳細邏輯讀者朋友們可閱讀下文相關類代碼的鏈接進行進一步的學習。

引用


[1].https://github.com/google/guava/wiki/CachesExplained#refresh
[2].https://dzone.com/articles/scalable-metadata-service-in-alluxio-storing-billi
[3].https://dzone.com/articles/store-1-billion-files-in-alluxio-20
[4].https://github.com/Alluxio/alluxio/blob/master/core/server/master/src/main/java/alluxio/master/metastore/caching/CachingInodeStore.java
[5].https://github.com/Alluxio/alluxio/blob/master/core/server/master/src/main/java/alluxio/master/metastore/caching/Cache.java

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章