【大數據技術棧】從Yarn的OOM去探索未知的奧祕

我發現我真的是上天的寵兒，在我手上，Yarn的虛擬內存居然崩了，是的，它崩了。我這本來就是個測試的集羣，數據量也不大。一次開的內存也不大，但是它崩了，虛擬內存崩了。請看案例分析。

案件回放

事情的經過是這樣的:

因爲需要，Yarn的原來的調度模式 Capacity Scheduler 對目前的項目而言不合適，就要去更換另外一種調度模式: Fair Scheduler。配置好的結果如下圖所示:

這說明我的配置沒問題呀。

現在，我要開始提叫我的開啓我的Flink集羣環境了:

./yarn-session.sh \
	-n 3 \
	-s 6 \
	-jm 256 \
	-tm 1024 \
	-nm "flink on yarn"
	-d

問題來了:

這個意思就是說，Flink的集羣部署時間超過了60s,叫我們檢查我們的請求資源在Yarn集羣裏面是否可用。換句話說，就是我們的Yarn集羣掛了，您自個去找原因吧。這個找原因，就只能找logs文件了。我們找到日誌文件，從裏面去找相關信息:

325.2 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2020-04-01 15:25:37,410 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 209.6 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2020-04-01 15:25:40,419 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.3 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:40,427 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 340.0 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:43,450 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:43,481 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 340.1 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:46,503 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:46,526 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:49,545 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:49,586 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:52,607 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.6 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:52,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:53,040 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories
2020-04-01 15:25:53,040 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/module/hadoop-2.7.2/logs/userlogs error, used space above threshold of 90.0%, removing from list of valid directories
2020-04-01 15:25:53,040 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/module/hadoop-2.7.2/logs/userlogs
2020-04-01 15:25:53,040 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/module/hadoop-2.7.2/logs/userlogs

我找到這麼一段話，爲了看的方便，我截個圖:

意思就是說：我們的一個 container_1585725830038_0003_02_000001 ，他的使用的物理內存使用了336.4MB/1GB，虛擬內存使用了: 2.3GB/2.1GB 。我這個表達方式是: 實際使用量 / 總量。

很明顯就可以看到我們的虛擬內存明顯不對，我只有2.1，你怎麼冒了一個 2.3 出來了呢？這可不久 OOM 嗎？

Yarn 的虛擬內存

關於Yarn的虛擬內存，官方有這麼幾個配置參數:

<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>true</value>
  <description>Whether virtual memory limits will be enforced for containers.</description>
</property>

<property>
  <name>yarn.nodemanager.vmem-pmem-ratio</name>
  <value>2.1</value>
  <description>	Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.</description>
</property>

<property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
  <description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager.</description>
</property>

<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>4</value>
  <description>	The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException.</description>
</property>

<property>
  <name>yarn.nodemanager.elastic-memory-control.enabled</name>
  <value>false</value>
  <description>Enable elastic memory control. This is a Linux only feature. When enabled, the node manager adds a listener to receive an event, if all the containers exceeded a limit. The limit is specified by yarn.nodemanager.resource.memory-mb. If this is not set, the limit is set based on the capabilities. See yarn.nodemanager.resource.detect-hardware-capabilities for details. The limit applies to the physical or virtual (rss+swap) memory depending on whether yarn.nodemanager.pmem-check-enabled or yarn.nodemanager.vmem-check-enabled is set.</description>
</property>

什麼是虛擬內存

虛擬內存是我們的硬盤內存，被拿去充公了。

# 查看某個進程的虛擬內存使用
pmap -x pid

解決方案

內存小了，我們增大就是了。至於行不行，咋也不知道。只有嘗試後才能發現具體的原因所在。但是到這裏你們以爲問題解決了嗎?我告訴你們，不可能的。因爲主要的問題不在這裏，這只是我這個大問題裏面的小問題。這個 oom 是解決了。我去解決大問題去了。

在這裏需要說一下，如果出現上面提到的超時問題，有可能是因爲Yarn的OOM，但是具體的原因是需要我們去查看日誌文件的，這個日誌文件是要找logs/userlogs/application_xxx的日誌文件的。生命的意義就在於探索未知的奧祕。

【大數據技術棧】從Yarn的OOM去探索未知的奧祕

案件回放

Yarn 的虛擬內存

什麼是虛擬內存

解決方案

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

【數據結構與算法】圖論-你曾虐我千百遍，我卻待你如初戀

大數據框架開發基礎之Sqoop(1) 入門

【大數據計算引擎-Flink】從WordCount看Flink（上）

【設計模式】設計模式的七大原則-1

【Java學習筆記】MybatisPlus 入門，這篇就夠了

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結