Flink新特性之网络缓存消胀（Network Buffer Debloating）机制

前言

最近正在准备关于Flink 1.13 / 1.14版本新特性的内部分享，顺便做点记录。

又见网络缓存

很久没有聊过Flink的网络栈了，但相信大家对网络缓存（Network Buffer）这个概念不会陌生。它是Flink网络层数据交换的最小单元，承载序列化后的数据，以直接内存的形式分配，并且一个Buffer的大小就等于一个MemorySegment的大小，即taskmanager.memory.segment-size，默认为32kB。

上图示出了Buffer（实心黑框）在两个TaskManager之间的传输过程。其中：

RP = ResultPartition / RS = ResultSubpartition
IG = InputGate / IC = InputChannel
LocalBufferPool / NetworkBufferPool未示出

Buffer数量的推算规则是：

发送端ResultPartition分配的Buffer总数为ResultSubpartition的数量+1，且为了防止倾斜，每个ResultSubpartition可获得的Buffer数不能多于taskmanager.network.memory.max-buffers-per-channel（默认值10）。
接收端每个InputChannel独享（exclusive）的Buffer数为taskmanager.network.memory.buffers-per-channel（默认值2），InputGate可额外提供的浮动（floating）Buffer数为taskmanager.network.memory.floating-buffers-per-gate（默认值8）。

也就是说，如果一个作业的ExecutionGraph确定，那么我们可以用上述规则配合Tasks之间的DistributionPattern（POINTWISE / ALL_TO_ALL）估计网络缓存的需求量。

有什么问题？

大多数情况下，除了调整taskmanager.memory.network.fraction之外，我们都不需要担心网络缓存的其他方面。

但是，网络缓存的配置都是静态的，如果缓存了大量的数据（特别是并行度比较大的任务），不仅浪费内存空间，而且不利于Checkpointing过程——如果是传统的对齐检查点，Barrier需要经过更长的时间（即经过更多的in-flight数据）才能对齐；如果启用了非对齐检查点，需要做快照的in-flight数据也会变多。网络缓存消胀（Network Buffer Debloating）就是为了尽量减少in-flight数据而产生的。

如FLIP-183中的下图所示，接收端的Buffer大小不再是固定的，而是动态地根据一个预设的消费时间阈值和一定时间段内的吞吐量确定。

当然，Buffer的数量仍然是固定的。经过Debloat之后，Buffer的大小最小可以达到taskmanager.memory.min-segment-size（默认值为1kB）。

Buffer Debloating相关设置

taskmanager.network.memory.buffer-debloat.enabled
是否启用网络缓存消胀，默认为false。
taskmanager.network.memory.buffer-debloat.period
调度缓存消胀操作的时间周期，默认为200ms。
taskmanager.network.memory.buffer-debloat.samples
计算新Buffer大小时的采样数，默认为20，表示考虑之前20个旧的Buffer的大小。
taskmanager.network.memory.buffer-debloat.target
In-flight数据被接收方消费的期望时间阈值，默认为1s。
taskmanager.network.memory.buffer-debloat.threshold-percentages
缓存消胀过程中的新旧Buffer相对变化率的阈值，默认为25（即25%）。若变化率小于此值，则不执行Debloat操作，可以避免频繁调整产生性能抖动。

Buffer Debloating的实现

一个StreamTask启动后，在其invoke()方法中就会借助Timer以上述的周期开始Buffer Debloating的调度。

private void scheduleBufferDebloater() {
    // See https://issues.apache.org/jira/browse/FLINK-23560
    // If there are no input gates, there is no point of calculating the throughput and running
    // the debloater. At the same time, for SourceStreamTask using legacy sources and checkpoint
    // lock, enqueuing even a single mailbox action can cause performance regression. This is
    // especially visible in batch, with disabled checkpointing and no processing time timers.
    if (getEnvironment().getAllInputGates().length == 0
            || !environment
                    .getTaskManagerInfo()
                    .getConfiguration()
                    .getBoolean(TaskManagerOptions.BUFFER_DEBLOAT_ENABLED)) {
        return;
    }
    systemTimerService.registerTimer(
            systemTimerService.getCurrentProcessingTime() + bufferDebloatPeriod,
            timestamp ->
                    mainMailboxExecutor.execute(
                            () -> {
                                debloat();
                                scheduleBufferDebloater();
                            },
                            "Buffer size recalculation"));
}

debloat()方法会调用该StreamTask内所有InputGate的triggerDebloating()方法。

void debloat() {
    for (IndexedInputGate inputGate : environment.getAllInputGates()) {
        inputGate.triggerDebloating();
    }
}

来到SingleInputGate#triggerDebloating()方法。该方法分三步执行：

计算吞吐量；
重新计算Buffer大小；
如果Buffer大小有变，将新的Buffer大小传播到底层。

public void triggerDebloating() {
    if (isFinished() || closeFuture.isDone()) {
        return;
    }
    checkState(bufferDebloater != null, "Buffer debloater should not be null");
    final long currentThroughput = throughputCalculator.calculateThroughput();
    bufferDebloater
            .recalculateBufferSize(currentThroughput, getBuffersInUseCount())
            .ifPresent(this::announceBufferSize);
}

计算吞吐量的逻辑比较简单，位于ThroughputCalculator#calculateThroughput()方法中，即单位时间内累计的数据量。

/** @return Calculated throughput based on the collected data for the last period. */
public long calculateThroughput() {
    if (measurementStartTime != NOT_TRACKED) {
        long absoluteTimeMillis = clock.absoluteTimeMillis();
        currentMeasurementTime += absoluteTimeMillis - measurementStartTime;
        measurementStartTime = absoluteTimeMillis;
    }
    long throughput = calculateThroughput(currentAccumulatedDataSize, currentMeasurementTime);
    currentAccumulatedDataSize = currentMeasurementTime = 0;
    return throughput;
}

public long calculateThroughput(long dataSize, long time) {
    checkArgument(dataSize >= 0, "Size of data should be non negative");
    checkArgument(time >= 0, "Time should be non negative");
    if (time == 0) {
        return currentThroughput;
    }
    return currentThroughput = instantThroughput(dataSize, time);
}

static long instantThroughput(long dataSize, long time) {
    return (long) ((double) dataSize / time * MILLIS_IN_SECOND);
}

在第二步则先根据吞吐量和期望消费时间计算出总的Buffer大小的初始目标值，然后利用滑动指数平均（Exponential Moving Average, EMA）算法求出每个Buffer的大小。该算法可以有效地抹平流量毛刺带来的影响。

BufferSizeEMA类中也用到了之前提到的采样数，代码略去。最后检查变化率，并更新Buffer大小。

public OptionalInt recalculateBufferSize(long currentThroughput, int buffersInUse) {
    int actualBuffersInUse = Math.max(1, buffersInUse);
    long desiredTotalBufferSizeInBytes =
            (currentThroughput * targetTotalBufferSize) / MILLIS_IN_SECOND;
    int newSize =
            bufferSizeEMA.calculateBufferSize(
                    desiredTotalBufferSizeInBytes, actualBuffersInUse);
    lastEstimatedTimeToConsumeBuffers =
            Duration.ofMillis(
                    newSize
                            * actualBuffersInUse
                            * MILLIS_IN_SECOND
                            / Math.max(1, currentThroughput));
    boolean skipUpdate = skipUpdate(newSize);
    
    // Skip update if the new value pretty close to the old one.
    if (skipUpdate) {
        return OptionalInt.empty();
    }
    lastBufferSize = newSize;
    return OptionalInt.of(newSize);
}

最后，更新的Buffer大小会由InputGate传播到其包含的每个InputChannel。根据其种类，又有两种处理方式：

若为LocalInputChannel（即连接本地输出），则直接更新其对应的ResultSubpartition的Buffer大小；
若为RemoteInputChannel（即通过Netty连接其他TaskManager的远端输出），则将NewBufferSizeMessage通过此Channel发送出去。

The End

民那晚安好梦。

Flink新特性之网络缓存消胀（Network Buffer Debloating）机制

前言

又见网络缓存

有什么问题？

Buffer Debloating相关设置

Buffer Debloating的实现

The End

关于游戏付费的一点想法

我通过CKA和CKS啦！

淺談軟件工程中的Shim

Flink RichFunction題目一則

「Daylight -デイライト-」（日光）

2022。

淺談Flink批模式Adaptive Hash Join

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結