[Arm Linux]cpuidle之menu governor

  • Concepts and ideas behind the menu governor
  • For the menu governor, there are 3 decision factors for picking a C state:
    1. Energy break even point
    1. Performance impact
    1. Latency tolerance (from pmqos infrastructure)
  • These these three factors are treated independently.
  • Energy break even point

  • C state entry and exit have an energy cost, and a certain amount of time in the C state is required to actually break even on this cost. CPUIDLE provides us this duration in the “target_residency” field. So all that we need is a good prediction of how long we’ll be idle. Like the traditional menu governor, we start with the actual known “next timer event” time.

  • Since there are other source of wakeups (interrupts for example) than the next timer event, this estimation is rather optimistic. To get a more realistic estimate, a correction factor is applied to the estimate, that is based on historic behavior. For example, if in the past the actual duration always was 50% of the next timer tick, the correction factor will be 0.5.

  • menu uses a running average for this correction factor, however it uses a set of factors, not just a single factor. This stems from the realization that the ratio is dependent on the order of magnitude of the expected duration; if we expect 500 milliseconds of idle time the likelihood of getting an interrupt very early is much higher than if we expect 50 micro seconds of idle time. A second independent factor that has big impact on the actual factor is if there is (disk) IO outstanding or not.

  • (as a special twist, we consider every sleep longer than 50 milliseconds as perfect; there are no power gains for sleeping longer than this)

  • For these two reasons we keep an array of 12 independent factors, that gets indexed based on the magnitude of the expected duration as well as the “is IO outstanding” property.

  • Repeatable-interval-detector


  • There are some cases where “next timer” is a completely unusable predictor:

  • Those cases where the interval is fixed, for example due to hardware interrupt mitigation, but also due to fixed transfer rate devices such as mice.

  • For this, we use a different predictor: We track the duration of the last 8 intervals and if the stand deviation of these 8 intervals is below a threshold value, we use the average of these intervals as prediction.

  • Limiting Performance Impact


  • C states, especially those with large exit latencies, can have a real noticeable impact on workloads, which is not acceptable for most sysadmins, and in addition, less performance has a power price of its own.

  • As a general rule of thumb, menu assumes that the following heuristic holds:

  • The busier the system, the less impact of C states is acceptable

  • This rule-of-thumb is implemented using a performance-multiplier: If the exit latency times the performance multiplier is longer than the predicted duration, the C state is not considered a candidate for selection due to a too high performance impact. So the higher this multiplier is, the longer we need to be idle to pick a deep C state, and thus the less likely a busy CPU will hit such a deep C state.

  • Two factors are used in determing this multiplier:

  • a value of 10 is added for each point of “per cpu load average” we have.

  • a value of 5 points is added for each process that is waiting for IO on this CPU.

  • (these values are experimentally determined)

  • The load average factor gives a longer term (few seconds) input to the decision, while the iowait value gives a cpu local instantanious input.

  • The iowait factor may look low, but realize that this is also already represented in the system load average.


  • menu governor背后的概念和想法

  • 对于menu governor,选择C状态有3个决策因素:

  • 1)能量收支平衡点

  • 2)绩效影响

  • 3)延迟容限(来自pmqos基础架构)

  • 这三个因素被独立对待。

  • 能量收支平衡点


  • C状态的进入和退出会消耗能量,因此在C状态下要花费一定的时间才能真正实现收支平衡。 CPUIDLE在“ target_residency”字段中为我们提供了此持续时间。因此,我们所需要做的只是很好地预测我们将闲置多长时间。像传统的菜单调节器一样,我们从实际已知的“下次计时器事件”时间开始。

  • 由于除了下一个定时器事件以外,还有其他唤醒源(例如中断),因此此估计相当乐观。为了获得更实际的估计,将基于历史行为的校正因子应用于估计。例如,如果过去的实际持续时间始终是下一个计时器刻度的50%,则校正因子将为0.5。

*菜单使用移动平均值作为该校正因子,但是它使用一组因子,而不仅仅是一个因子。这是由于认识到该比率取决于预期持续时间的数量级。如果我们希望有500毫秒的空闲时间,那么尽早获得中断的可能性比我们期望的50微秒的空闲时间要高得多。对实际因素影响很大的第二个独立因素是(磁盘)IO是否未完成。
*(作为一种特殊的转折,我们认为每次睡眠时间超过50毫秒都是完美的;睡眠时间超过此时间不会产生功率增益)

  • 由于这两个原因,我们保留了12个独立因素的数组,这些因素基于预期持续时间的大小以及“是IO突出”属性进行索引。

  • 可重复间隔检测器


*在某些情况下,“下一个计时器”是完全不可用的预测变量:
*间隔固定的情况,例如,由于缓解了硬件中断,还由于固定速率的设备(例如鼠标)。
*为此,我们使用不同的预测变量:我们跟踪最近8个间隔的持续时间,如果这8个间隔的标准偏差低于阈值,则将这些间隔的平均值用作预测。

  • 限制性能影响

  • C状态,尤其是那些具有较大退出延迟的状态,可能会对工作负载产生真正的显着影响,这对于大多数系统管理员来说是不可接受的,此外,性能降低本身具有功耗价格。

*作为一般的经验法则,菜单假定以下启发式成立:
*系统越忙,可接受的C状态影响就越小

  • 此经验法则是使用性能倍数实现的:如果退出延迟乘以性能倍数的时间比预测的持续时间长,则由于对性能的影响过大,C状态不被视为选择候选对象。因此,该乘数越高,我们需要空闲的时间就越长,以选择深度C状态,因此繁忙的CPU达到这种深度C状态的可能性就越小。

*在确定此乘数时使用了两个因素:
*对于我们具有的“每CPU平均负载”的每个点,将添加10值。
*为该CPU上等待IO的每个进程添加5点的值。
*(这些值是实验确定的)

  • 负载平均因子为决策提供了较长的输入时间(几秒钟),而iowait值给出了cpu本地瞬时输入。
  • iowait因子可能看起来很低,但是请注意,这也已经在系统平均负载中表示出来。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章