最近遇到hive執行create,drop table語句時出現延時其它語句正常,平時秒級現在都需要200s才完成。經過排查發現有個用戶使用手機號做動態分區字段,導致一下有上百萬分區寫入,hive metastore server出現問題,線程數飆升至1k多,內存升高。將任務停止後線程數下降,內存下降恢復正常水平,但是create,drop table還是不正常耗時200s才執行完成。
sentry出現如下警告日誌:
timed out wait request for id xxxx
hive metastore server 出現如下Error日誌:
Failed to sync requested HMS notifications up to the event ID xxxxx
查看sentry 異常CounterWait源碼發現傳遞的id比 currentid 大導致一直等待超時,超時時間爲200s。
public long waitFor(long value) throws InterruptedException, TimeoutException {
// Fast path - counter value already reached, no need to block
if (value <= currentId.get()) {
return currentId.get();
}
// Enqueue the waiter for this value
ValueEvent eid = new ValueEvent(value);
waiters.put(eid);
// It is possible that between the fast path check and the time the
// value event is enqueued, the counter value already reached the requested
// value. In this case we return immediately.
if (value <= currentId.get()) {
return currentId.get();
}
// At this point we may be sure that by the time the event was enqueued,
// the counter was below the requested value. This means that update()
// is guaranteed to wake us up when the counter reaches the requested value.
// The wake up may actually happen before we start waiting, in this case
// the event's blocking queue will be non-empty and the waitFor() below
// will not block, so it is safe to wake up before the wait.
// So sit tight and wait patiently.
eid.waitFor();
LOGGER.debug("CounterWait added new value to waitFor: value = {}, currentId = {}", value, currentId.get());
return currentId.get();
}
分析這塊源碼邏輯,這塊主要是開啓了hdfs-sentry acl同步後,hdfs, sentry, hive metastore server三者間權限同步的消息處理。當突然大批量的目錄權限消息需要處理,後臺線程處理不過來,消息積壓滯後就會出現這個異常。這個異常不影響集羣使用,只是會導致create,drop table 慢需要等200s,這樣等待也是爲了追上最新的id,可以通過設置sentry sentry.notification.sync.timeout.ms(默認200s)參數調小超時時間,減小等待時間,積壓不多的話可以讓它自行消費處理掉。我們這次同時出現了hive metastore server 參與同步消息處理的線程被異常退出,導致sentry的sentry_hms_notification_id 表數據一直沒更新,需要重啓hive metastore server。如果積壓了太多消息,讓它慢慢消費處理需要的時間太長,可能一直追不上,這時可以選擇丟掉這些消息。具體操作在sentry sentry_hms_notification_id 表中插入一條最大值(等於當前消息的id,從notification_sequence 表中獲取) ,重啓sentry 服務。notification_log 表存儲了消息日誌信息。