hive metastore server Failed to sync requested HMS notifications up to the event ID xxxxx

原創

2020-02-25 00:58

最近遇到hive執行create,drop table語句時出現延時其它語句正常，平時秒級現在都需要200s才完成。經過排查發現有個用戶使用手機號做動態分區字段，導致一下有上百萬分區寫入，hive metastore server出現問題，線程數飆升至1k多，內存升高。將任務停止後線程數下降，內存下降恢復正常水平，但是create,drop table還是不正常耗時200s才執行完成。

sentry出現如下警告日誌：

timed out wait request for id xxxx

hive metastore server 出現如下Error日誌：

Failed to sync requested HMS notifications up to the event ID xxxxx

查看sentry 異常CounterWait源碼發現傳遞的id比 currentid 大導致一直等待超時，超時時間爲200s。

public long waitFor(long value) throws InterruptedException, TimeoutException {
    // Fast path - counter value already reached, no need to block
    if (value <= currentId.get()) {
      return currentId.get();
    }

    // Enqueue the waiter for this value
    ValueEvent eid = new ValueEvent(value);
    waiters.put(eid);

    // It is possible that between the fast path check and the time the
    // value event is enqueued, the counter value already reached the requested
    // value. In this case we return immediately.
    if (value <= currentId.get()) {
      return currentId.get();
    }

    // At this point we may be sure that by the time the event was enqueued,
    // the counter was below the requested value. This means that update()
    // is guaranteed to wake us up when the counter reaches the requested value.
    // The wake up may actually happen before we start waiting, in this case
    // the event's blocking queue will be non-empty and the waitFor() below
    // will not block, so it is safe to wake up before the wait.
    // So sit tight and wait patiently.
    eid.waitFor();
    LOGGER.debug("CounterWait added new value to waitFor: value = {}, currentId = {}", value, currentId.get());
    return currentId.get();
  }

分析這塊源碼邏輯，這塊主要是開啓了hdfs-sentry acl同步後，hdfs， sentry， hive metastore server三者間權限同步的消息處理。當突然大批量的目錄權限消息需要處理，後臺線程處理不過來，消息積壓滯後就會出現這個異常。這個異常不影響集羣使用，只是會導致create，drop table 慢需要等200s，這樣等待也是爲了追上最新的id，可以通過設置sentry sentry.notification.sync.timeout.ms（默認200s）參數調小超時時間，減小等待時間，積壓不多的話可以讓它自行消費處理掉。我們這次同時出現了hive metastore server 參與同步消息處理的線程被異常退出，導致sentry的sentry_hms_notification_id 表數據一直沒更新，需要重啓hive metastore server。如果積壓了太多消息，讓它慢慢消費處理需要的時間太長，可能一直追不上，這時可以選擇丟掉這些消息。具體操作在sentry sentry_hms_notification_id 表中插入一條最大值(等於當前消息的id，從notification_sequence 表中獲取) ，重啓sentry 服務。notification_log 表存儲了消息日誌信息。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hive metastore server Failed to sync requested HMS notifications up to the event ID xxxxx

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

kerberos環境下hive server2使用負載均衡異常

自研大數據分析平臺任務提交方式

flink table 使用Kafka Connector處理複雜json

flink 廣播變量

flink jdbc連接器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結