Flink一致性保證實現剖析

概述

Flink通過快照機制和Barrier來實現一致性的保證，當任務中途crash或者cancel之後，可以通過checkpoing或者savepoint來進行恢復，實現數據流的重放。從而讓任務達到一致性的效果，這種一致性需要開啓exactly_once模式之後才行。需要記住的是這邊的Flink exactly_once只是說在Flink內部是exactly_once的，並不能保證與外部存儲交互時的exactly_once，如果要實現外部存儲連接後的exactly_once，需要進行做一些特殊的處理。Flink定義的checkpiont支持兩種模式（CheckpointingMode):

EXACTLY_ONCE
AT_LEAST_ONCE

EXACTLY ONCE

該模式意味着系統在進行恢復時，每條記錄將在Operator狀態中只被重現/重放一次。例如在一段數據流中，不管該系統crash或者重啓了多少次，該統計結果將總是跟流中的元素的真實個數一致。

當然EXACTLY_ONCE並不是說毫無確定，相比較AT_LEAST_ONCE,整體的處理速度會相對比較慢，因爲在開啓EXACTLY_ONCE後，爲了保證一致性開啓了數據對齊，從而影響了一些性能。

AT LEAST ONCE

該模式意味着系統將以一種更加簡單的方式來對operator的狀態進行快照，系統crash或者cancel後恢復時，operator的狀態中有一些記錄可能會被重放多次。

例如，以上面的例子講說，失敗後恢復時，統計值將等於或者大於流中元素的真實值。這種模式因爲不需要對齊所有對延遲產生的影響很小，處理速度也更加快速，通常應用於接收低延時並且能夠容忍重複消息的場景。

一致性實現原理

雖然上面講到了一致性的保證是通過快照和Brrier機制來實現的，那他們具體是如何實現的呢？閱讀中可以通過帶入以下幾點來進行考慮：

快照中保存的是什麼？
什麼時候觸發系統進行執行快照？
如何在流式計算中既要執行快照又要保證整體的處理速度？

CHECKPOINT

快照記錄了系統當前各個task/Operator的狀態，這些狀態保存了正常處理的元素。這些快照將被定期的刪除和更新，系統出現crash後，進行恢復時就會從這些快照中讀取數據，恢復crash之前的狀態，那麼該如何理解狀態（STATE)呢？

STATE

State 可以理解爲某task/operator在某時刻的一箇中間結果，比如在flatmap中在這段時刻處理的數據，State可以被記錄，在系統失敗的情況可以進行恢復。STATE主要有兩種類型operator state和keyed state。

OPERATOR STATE和KEYED STATE

Operator state是一個與key無關，並且在全局中唯一綁定到特定的operator中的state,比如有source或者map算子，如果需要保存這些operator的狀態，就可以在這些operator添加狀態的處理機制，具體可以看下面的例子。

Operator state只有一種數據結構ListState<T>,具體checkpoint過程中會把該數據結構的數據寫入到hdfs中，用於保存該operator在當前的狀態。

Keyed State:

基於KeyStream之上的狀態，如dataStream.keyBy()
keyby之後的operator state

keyed state的數據結構：

ValueState<T>
LisstState<T>
ReducingState<T>
MapState<UK,UV>

CHECKPOINT實現例子

這是operator state實現的例子

public class BufferingSink implements SinkFunction<Tuple2<String,Integer>>,CheckpointedFunction {
    private final int threshold;
    private transient ListState<Tuple2<String,Integer>> checkpointedState;
    private List<Tuple2<String,Integer>> bufferedElements;

    public BufferingSink(int threshold) {
        this.threshold = threshold;
        this.bufferedElements = new ArrayList<Tuple2<String, Integer>>();
    }
    @Override
    public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
        bufferedElements.add(value);
        if(bufferedElements.size() == threshold){
            for(Tuple2<String,Integer> element:bufferedElements){
                //send it to the sink
            }
            bufferedElements.clear();
        }
    }
    @Override
    public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception {
        /**定期實現checkpoint*/
        checkpointedState.clear();
        for(Tuple2<String,Integer> element:bufferedElements){
            checkpointedState.add(element);
        }
    }
    /**恢復初始化的時候從保存的快照中獲取數據，用於恢復到crash之前的狀態*/
    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        ListStateDescriptor<Tuple2<String,Integer>> descriptor = new ListStateDescriptor<Tuple2<String, Integer>>
                ("buffered-elements",TypeInformation.of(new TypeHint<Tuple2<String, Integer>>() {
        }));
        checkpointedState = context.getOperatorStateStore().getListState(descriptor);
        if(context.isRestored()){
            for(Tuple2<String,Integer> element:checkpointedState.get()){
                bufferedElements.add(element);
            }
        }
    }
}

這是keyed state實現的例子：

public static class StateMachineMapper extends RichFlatMapFunction<Event, Alert> {

	/** 爲當前key創建一個keyed state. */
	private ValueState<State> currentState;
	@Override
	public void open(Configuration conf) {
		// 啓動時從checkpoint中加載保存的state
		currentState = getRuntimeContext().getState(
					new ValueStateDescriptor<>("state", State.class));
	}
	@Override
	public void flatMap(Event evt, Collector<Alert> out) throws Exception {
		// 獲取當前key的state值，如果沒有則初始化
		State state = currentState.value();
		if (state == null) {
			state = State.Initial;
		}
		// 根據給定的事件詢問狀態機我們應該進入什麼狀態
		State nextState = state.transition(evt.type());
		if (nextState == State.InvalidTransition) {
			out.collect(new Alert(evt.sourceAddress(), state, evt.type()));
		} else if (nextState.isTerminal()) {
			currentState.clear();
		} else {
			currentState.update(nextState);
		}
	}
}

BARRIER

相對於checkpoint並沒有需要很高深的理解，因爲這種機制在spark，hdfs等需要高容錯機制的系統都會涉及，Flink的高效一致性保證的核心概念之一是Barrier，這個Barrier是用來解決上面提到的問題2（什麼時候觸發快照）。它就是一個屏障，一個關卡，用來把無界流的流式數據變爲有界流，每隔一段時間處理一段有界流，當開啓EXACTLY_ONCE後，Barrier會被注入到輸入流中隨着數據一起向下流動，當所有的operator得到是Barrier類型的數據流時就會進行實現SNAPSHOT,並且Barriers永遠不會超過記錄，數據流嚴格有序。每個Barrier都帶有一個long型的checkpointId,當operator執行完SNAPSHOT後，會ack當前operator的checkpointId給JobManager，JobManager收集齊所有的當前checkpointId時，纔會放開下一批的數據進行處理。

Barrier在數據輸入流源處被注入並行數據流中。SNAPSHOTn的Barriers被插入的位置（Sn)是SNAPSHOT所包含的數據在數據源中最大位置，例如在kafka中，此位置將是分區中最後一條記錄的偏移量。將該位置Sn報告給checkpoint協調器。然後Barrier向下遊動。當一箇中間operator從其他所有輸入流中受到SNAPSHOTn的barriers時，他會成爲SNAPSHOTn發出barriers進入其所有輸出流中。一旦sink操作算子（流失DAG的末端）從其所有輸入流接收到barrier n，它就向checkpoint協調器確認SNAPSHOTn完成。在所有sink確認快照後，意味着快照已經完成。一旦完成SNAPSHOTn，job將永遠不再向數據源請求sn之前的數據，因爲此時這些記錄（及其後續記錄）將已經通過整個數據流拓撲，也即是已經被處理結束啦。

接收多個輸入流的運算符需要基於快照barrier對齊輸入流。上圖說明了這一點：

一旦operator從一個輸入流接收到快照barrier n,它就不能處理來自該流的任何記錄，知道它從其他輸入接收到barrier n爲止，否則，它會搞混屬於快照n的記錄和屬於快照n+1的記錄
barrier 你所屬的流暫時會被擱置，從這些流接收的記錄不會被處理，而是放入輸入緩衝區。
一旦從最後一個流接收到barrier n,操作算子就會發送所有掛起的向後傳送的記錄，然後自己發出SNAPSHOTn的barriers
之後，它恢復處理來自所有輸入流的記錄，在處理來自流的記錄之前有限處理來自緩衝區的記錄。

講述完Barrier可以看下圖，checkpointing的過程：

算子在他們從輸入流接收到所有SNAPSHOT障礙時，以及在向其輸出流發出障礙之前對其狀態進行SNAPSHOT。此時，將根據障礙之前的記錄對狀態進行所有更新，並且在應用障礙之後不依賴於記錄的更新。由於SNAPSHOT的狀態可能很大，因此它存儲在可配置的狀態後臺中。默認情況下，這是JobManager的內存，但對於生產使用，應配置分佈式可靠存儲（例如HDFS）。在存儲狀態之後，算子確認檢查點，將SNAPSHOT屏障發送到輸出流中，然後繼續。

生成的SNAPSHOT現在包含：

對於每個並行流數據源，啓動SNAPSHOT時流中的偏移/位置

對於每個算子，指向作爲SNAPSHOT的一部分存儲的狀態的指針

BARRIER核心代碼解析

上面講到Flink的一致性保證的核心之一就是Barrier，下面會對barrier的核心代碼BarrierBuffer進行講解，BarrierBuffer用於提供EXACTLY_ONCE一致性保證，其作用是：它將以barrier阻塞輸入知道所有的輸入都接收到基於某個檢查點的barrier，也就是之前講到的對齊，爲了避免反壓輸入流（這可能導致分佈式死鎖），BarrierBuffer將從被阻塞的channel中持續地接收buffer並在內部存儲它們，知道阻塞被解除。

CheckpointCoordinator

在講BarrierBuffer之前，可以先看下checkpoint是什麼時候觸發創建的，可以從CheckpointCoordinator這個Checkpoint協調器的startCheckpointScheduler()這個方法看出，在該方法創建了一個線程用來定時發送checkpoint的方法。

public void startCheckpointScheduler() {
	synchronized (lock) {
		if (shutdown) {
			throw new IllegalArgumentException("Checkpoint coordinator is shut down");
		}

		// make sure all prior timers are cancelled
		stopCheckpointScheduler();

		periodicScheduling = true;
		long initialDelay = ThreadLocalRandom.current().nextLong(
			minPauseBetweenCheckpointsNanos / 1_000_000L, baseInterval + 1L);
		//按照baseInterval定時啓動觸發器
		currentPeriodicTrigger = timer.scheduleAtFixedRate(
				new ScheduledTrigger(), initialDelay, baseInterval, TimeUnit.MILLISECONDS);
	}
}

private final class ScheduledTrigger implements Runnable {

	@Override
	public void run() {
		try {
		    //觸發checkpoint
			triggerCheckpoint(System.currentTimeMillis(), true);
		}
		catch (Exception e) {
			LOG.error("Exception while triggering checkpoint for job {}.", job, e);
		}
	}
}

//在triggerCheckpoint方法中會調用所有具有checkpoint的Execution方法triggerCheckpoint
// send the messages to the tasks that trigger their checkpoint
for (Execution execution: executions) {
	execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}

BarrierBuffer

介紹了checkpoint的觸發方式後，再回來看BarrierBuffer類，該類有幾個核心的方法，下面將進行一一解釋。 getNextNonBlocked getNextNonBlocked方法用於獲取待operator處理的下一條（非阻塞）的記錄。該方法以多種機制阻塞當前調用上下文，直到獲取到下一個非阻塞的記錄。

@Override
public BufferOrEvent getNextNonBlocked() throws Exception {
	while (true) {
		//獲得下一個待緩存的buffer或者barrier事件
		// process buffered BufferOrEvents before grabbing new ones
		Optional<BufferOrEvent> next;
		//如果當前的緩衝區爲null，則從輸入端獲得
		if (currentBuffered == null) {
			next = inputGate.getNextBufferOrEvent();
		}
		//如果緩衝區不爲空，則從緩衝區中獲得數據
		else {
			next = Optional.ofNullable(currentBuffered.getNext());
			//如果緩衝區獲取的數據不存在，則表示緩衝區中已經沒有更多地數據了
			if (!next.isPresent()) {
				//清空當前緩衝區，獲取已經新的緩衝區並打開它
				completeBufferedSequence();
				//遞歸調用，處理下一條數據
				return getNextNonBlocked();
			}
		}
		//獲取到一條記錄，表示該數據存在
		if (!next.isPresent()) {
			//輸入流的結束。stream繼續處理緩衝數據
			if (!endOfStream) {
				// end of input stream. stream continues with the buffered data
				endOfStream = true;
				releaseBlocksAndResetBarriers();
				return getNextNonBlocked();
			} else {
				// final end of both input and buffered data
				return null;
			}
		}
		BufferOrEvent bufferOrEvent = next.get();
		//如果獲取到的記錄所在的channel已經處於阻塞狀態，則該記錄會被加入緩衝區
		if (isBlocked(bufferOrEvent.getChannelIndex())) {
			// if the channel is blocked, we just store the BufferOrEvent
			bufferBlocker.add(bufferOrEvent);
			checkSizeLimit();
		}
		//如果該記錄是一個正常的記錄，而不是一個barrier事件，則直接返回
		else if (bufferOrEvent.isBuffer()) {
			return bufferOrEvent;
		}
		//如果是一個barrier事件
		else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
			//並且當前流還未處於結束樁體，則處理該barrier
			if (!endOfStream) {
				// process barriers only if there is a chance of the checkpoint completing
				processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
			}
		}
		//它發出信號，表示應該取消某個檢查點。需要取消該檢查點的任何正在進行的對齊，並恢復常規處理。
		else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
			processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent());
		} else {
			//如果它是一個EndOfPartitionEvent，表示當前已經到達分區末尾
			if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
				processEndOfPartition();
			}
			return bufferOrEvent;
		}
	}
}

private void processEndOfPartition() throws Exception {
	//以關閉的channel計數器加一
	numClosedChannels++;
	//此時已經沒有機會完成該檢查點，則解除阻塞
	if (numBarriersReceived > 0) {
		// let the task know we skip a checkpoint
		notifyAbort(currentCheckpointId, new InputEndOfStreamException());

		// no chance to complete this checkpoint
		releaseBlocksAndResetBarriers();
	}
}

當checkpoint完成之後會調用releaseBlocksAndResetBarriers()方法，該方法釋放所有通道上的塊並且重置barrier計數，確保下一次使用的時候能夠正常使用。

/** * Releases the blocks on all channels and resets the barrier count. * Makes sure the just written data is the next to be consumed. * 釋放所有通道上的塊並重置屏障計數。確保下一個使用的是剛剛寫好的數據。 */
private void releaseBlocksAndResetBarriers() throws IOException {
	LOG.debug("{}: End of stream alignment, feeding buffered data back.",
		inputGate.getOwningTaskName());

	for (int i = 0; i < blockedChannels.length; i++) {
		//將所有channel的阻塞標誌設置爲false
		blockedChannels[i] = false;
	}
	//如果當前的緩衝區中數據爲空
	if (currentBuffered == null) {
		// common case: no more buffered data
		//初始化新的緩衝區讀寫器
		currentBuffered = bufferBlocker.rollOverReusingResources();
		//打開緩衝區讀寫器
		if (currentBuffered != null) {
			currentBuffered.open();
		}
	}
	else {
		// uncommon case: buffered data pending
		// push back the pending data, if we have any
		LOG.debug("{}: Checkpoint skipped via buffered data:" +
				"Pushing back current alignment buffers and feeding back new alignment data first.",
			inputGate.getOwningTaskName());

		// since we did not fully drain the previous sequence, we need to allocate a new buffer for this one
		//緩衝區中還有數據，則初始化一塊新的存儲空間來存儲新的緩衝數據
		BufferOrEventSequence bufferedNow = bufferBlocker.rollOverWithoutReusingResources();
		if (bufferedNow != null) {
			//打開新的緩衝區讀寫器
			bufferedNow.open();
			//將當前沒有處理完的數據加入隊列中
			queuedBuffered.addFirst(currentBuffered);
			numQueuedBytes += currentBuffered.size();
			//將新開闢的緩衝區讀寫器置爲新的當前緩衝區。
			currentBuffered = bufferedNow;
		}
	}

	if (LOG.isDebugEnabled()) {
		LOG.debug("{}: Size of buffered data: {} bytes",
			inputGate.getOwningTaskName(),
			currentBuffered == null ? 0L : currentBuffered.size());
	}

	// the next barrier that comes must assume it is the first
	// 將接受到的barrier累加值重置爲0
	numBarriersReceived = 0;

	if (startOfAlignmentTimestamp > 0) {
		latestAlignmentDurationNanos = System.nanoTime() - startOfAlignmentTimestamp;
		startOfAlignmentTimestamp = 0;
	}
}

還有一個很重要的方法processBarrier()方法，用來處理當接收一個Barrier事件時的具體處理方法。

private void processBarrier(CheckpointBarrier receivedBarrier, int channelIndex) throws Exception {
	final long barrierId = receivedBarrier.getId();
	// 單通道情況下的快速路徑
	if (totalNumberOfInputChannels == 1) {
		if (barrierId > currentCheckpointId) {
			// new checkpoint
			currentCheckpointId = barrierId;
			notifyCheckpoint(receivedBarrier);
		}
		return;
	}
	// -- general code path for multiple input channels --
	//獲取接收到的barrierId
	//接收到的barrier數目>0，說明當前正在處理某個檢查點的過程中
	if (numBarriersReceived > 0) {
		// this is only true if some alignment is already progress and was not canceled
		//當前某個檢查點的某個後續的barrierId
		if (barrierId == currentCheckpointId) {
			// regular case 處理barrier
			onBarrier(channelIndex);
		}
		//barrier Id>當前檢查點
		else if (barrierId > currentCheckpointId) {
			// we did not complete the current checkpoint, another started before
			//我們沒有完成當前的檢查點，之前又開始了一個
			LOG.warn("{}: Received checkpoint barrier for checkpoint {} before completing current checkpoint {}. " +
					"Skipping current checkpoint.",
				inputGate.getOwningTaskName(),
				barrierId,
				currentCheckpointId);

			// let the task know we are not completing this
			//讓任務知道我們沒有完成這項任務
			notifyAbort(currentCheckpointId, new CheckpointDeclineSubsumedException(barrierId));

			// abort the current checkpoint
			//中止當前檢查點,當前檢查點已經沒有機會完成了，則解除阻塞
			releaseBlocksAndResetBarriers();

			// begin a the new checkpoint
			beginNewAlignment(barrierId, channelIndex);
		}
		else {
			// ignore trailing barrier from an earlier checkpoint (obsolete now)
			return;
		}
	}
	else if (barrierId > currentCheckpointId) {
		// 說明這是一個新檢查點的初始barrier
		beginNewAlignment(barrierId, channelIndex);
	}
	else {
		//忽略之前（跳過的）檢查點的未處理的barrier
		// either the current checkpoint was canceled (numBarriers == 0) or
		// this barrier is from an old subsumed checkpoint
		return;
	}
	//檢查我們是否有所有的障礙——因爲被取消的檢查點總是沒有障礙
	//這隻能發生在一個未取消的檢查點上
	// check if we have all barriers - since canceled checkpoints always have zero barriers
	// this can only happen on a non canceled checkpoint
	if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
		// actually trigger checkpoint
		if (LOG.isDebugEnabled()) {
			LOG.debug("{}: Received all barriers, triggering checkpoint {} at {}.",
				inputGate.getOwningTaskName(),
				receivedBarrier.getId(),
				receivedBarrier.getTimestamp());
		}
		releaseBlocksAndResetBarriers();
		notifyCheckpoint(receivedBarrier);
	}
}

BarrierTracker

在AT_LEAST_ONCE的模式下，調用BarrierTracker類中的getNextNonBlocked()方法，從該方法可以看出，Barrier不會進行對齊，連續不斷的從inputGate中getNextBufferOrEvent().

@Override
public BufferOrEvent getNextNonBlocked() throws Exception {
	while (true) {
		Optional<BufferOrEvent> next = inputGate.getNextBufferOrEvent();
		if (!next.isPresent()) {
			// buffer or input exhausted
			return null;
		}

		BufferOrEvent bufferOrEvent = next.get();
		if (bufferOrEvent.isBuffer()) {
			return bufferOrEvent;
		}
		else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
			processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
		}
		else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
			processCheckpointAbortBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
		}
		else {
			// some other event
			return bufferOrEvent;
		}
	}
}

Flink一致性保證實現剖析

概述

一致性實現原理

java通過api對hadoop的操作

Hadoop 在重啓或者多次格式化後無法啓動datanode問題的解決

Ubuntu 13.04 安裝JDK7

Ubuntu13.04安裝SSH，以及配置多臺電腦之間無密碼登陸

hbase介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結