hive metastore server Failed to sync requested HMS notifications up to the event ID xxxxx

 最近遇到hive執行create,drop table語句時出現延時其它語句正常,平時秒級現在都需要200s才完成。經過排查發現有個用戶使用手機號做動態分區字段,導致一下有上百萬分區寫入,hive metastore server出現問題,線程數飆升至1k多,內存升高。將任務停止後線程數下降,內存下降恢復正常水平,但是create,drop table還是不正常耗時200s才執行完成。

sentry出現如下警告日誌:

timed out wait request for id  xxxx

hive metastore server 出現如下Error日誌:

Failed to sync requested HMS notifications up to the event ID xxxxx

 查看sentry 異常CounterWait源碼發現傳遞的id比 currentid 大導致一直等待超時,超時時間爲200s。

public long waitFor(long value) throws InterruptedException, TimeoutException {
    // Fast path - counter value already reached, no need to block
    if (value <= currentId.get()) {
      return currentId.get();
    }

    // Enqueue the waiter for this value
    ValueEvent eid = new ValueEvent(value);
    waiters.put(eid);

    // It is possible that between the fast path check and the time the
    // value event is enqueued, the counter value already reached the requested
    // value. In this case we return immediately.
    if (value <= currentId.get()) {
      return currentId.get();
    }

    // At this point we may be sure that by the time the event was enqueued,
    // the counter was below the requested value. This means that update()
    // is guaranteed to wake us up when the counter reaches the requested value.
    // The wake up may actually happen before we start waiting, in this case
    // the event's blocking queue will be non-empty and the waitFor() below
    // will not block, so it is safe to wake up before the wait.
    // So sit tight and wait patiently.
    eid.waitFor();
    LOGGER.debug("CounterWait added new value to waitFor: value = {}, currentId = {}", value, currentId.get());
    return currentId.get();
  }

 分析這塊源碼邏輯,這塊主要是開啓了hdfs-sentry acl同步後,hdfs, sentry, hive metastore server三者間權限同步的消息處理。當突然大批量的目錄權限消息需要處理,後臺線程處理不過來,消息積壓滯後就會出現這個異常。這個異常不影響集羣使用,只是會導致create,drop table 慢需要等200s,這樣等待也是爲了追上最新的id,可以通過設置sentry  sentry.notification.sync.timeout.ms(默認200s)參數調小超時時間,減小等待時間,積壓不多的話可以讓它自行消費處理掉。我們這次同時出現了hive metastore server 參與同步消息處理的線程被異常退出,導致sentry的sentry_hms_notification_id 表數據一直沒更新,需要重啓hive metastore server。如果積壓了太多消息,讓它慢慢消費處理需要的時間太長,可能一直追不上,這時可以選擇丟掉這些消息。具體操作在sentry sentry_hms_notification_id 表中插入一條最大值(等於當前消息的id,從notification_sequence 表中獲取) ,重啓sentry 服務。notification_log 表存儲了消息日誌信息。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章