接下來我們看容器的分配和在NM節點上的投運過程,接着上一篇:
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer():
public Resource assignContainer(FSSchedulerNode node) {
Resource assigned = Resources.none();
// If this queue is over its limit, reject
if (!assignContainerPreCheck(node)) {
return assigned;
}
// 根據FairScheduler的調度規則對子隊列進行排序
writeLock.lock();
try {
Collections.sort(childQueues, policy.getComparator());
} finally {
writeLock.unlock();
}
readLock.lock();
try {
for (FSQueue child : childQueues) {
assigned = child.assignContainer(node);
if (!Resources.equals(assigned, Resources.none())) {
break;
}
}
} finally {
readLock.unlock();
}
return assigned;
}
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer() -> FSLeafQueue.assignContainer()
public Resource assignContainer(FSSchedulerNode node) {
Resource assigned = none();
if (LOG.isDebugEnabled()) {
LOG.debug("Node " + node.getNodeName() + " offered to queue: " +
getName() + " fairShare: " + getFairShare());
}
if (!assignContainerPreCheck(node)) {
return assigned;
}
for (FSAppAttempt sched : fetchAppsWithDemand(true)) {
if (SchedulerAppUtils.isPlaceBlacklisted(sched, node, LOG)) {
continue;
}
assigned = sched.assignContainer(node);
if (!assigned.equals(none())) {
if (LOG.isDebugEnabled()) {
LOG.debug("Assigned container in queue:" + getName() + " " +
"container:" + assigned);
}
break;
}
}
return assigned;
}
從根隊列然後到子隊列,最後到子隊列中按照調度規則,應該被調度用於分配容器的應用程序的實例進行容器分配。
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer() -> FSLeafQueue.assignContainer() -> FSAppAttempt.assignContainer():
public Resource assignContainer(FSSchedulerNode node) {
if (isOverAMShareLimit()) {
PendingAsk amAsk = appSchedulingInfo.getNextPendingAsk();
updateAMDiagnosticMsg(amAsk.getPerAllocationResource(),
" exceeds maximum AM resource allowed).");
if (LOG.isDebugEnabled()) {
LOG.debug("AM resource request: " + amAsk.getPerAllocationResource()
+ " exceeds maximum AM resource allowed, "
+ getQueue().dumpState());
}
return Resources.none();
}
return assignContainer(node, false);
}
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
if (LOG.isTraceEnabled()) {
LOG.trace("Node offered to app: " + getName() + " reserved: " + reserved);
}
Collection<SchedulerRequestKey> keysToTry = (reserved) ?
Collections.singletonList(
node.getReservedContainer().getReservedSchedulerKey()) :
getSchedulerKeys();
//對於應用中的每一種優先級請求,看是否滿足NODE_LOCAL,RACK_LOCAL,OFF_SWITCH這三種本地性需求
//請求可能被延遲調度,如果設置了延期調度參數來提升本地性
try {
writeLock.lock();
for (SchedulerRequestKey schedulerKey : keysToTry) {
if (!reserved && !hasContainerForNode(schedulerKey, node)) {
continue;
}
addSchedulingOpportunity(schedulerKey);
PendingAsk rackLocalPendingAsk = getPendingAsk(schedulerKey,
node.getRackName());
PendingAsk nodeLocalPendingAsk = getPendingAsk(schedulerKey,
node.getNodeName());
if (nodeLocalPendingAsk.getCount() > 0
&& !appSchedulingInfo.canDelayTo(schedulerKey,
node.getNodeName())) {
LOG.warn("Relax locality off is not supported on local request: "
+ nodeLocalPendingAsk);
}
NodeType allowedLocality;
if (scheduler.isContinuousSchedulingEnabled()) {
allowedLocality = getAllowedLocalityLevelByTime(schedulerKey,
scheduler.getNodeLocalityDelayMs(),
scheduler.getRackLocalityDelayMs(),
scheduler.getClock().getTime());
} else {
allowedLocality = getAllowedLocalityLevel(schedulerKey,
scheduler.getNumClusterNodes(),
scheduler.getNodeLocalityThreshold(),
scheduler.getRackLocalityThreshold());
}
if (rackLocalPendingAsk.getCount() > 0
&& nodeLocalPendingAsk.getCount() > 0) {
if (LOG.isTraceEnabled()) {
LOG.trace("Assign container on " + node.getNodeName()
+ " node, assignType: NODE_LOCAL" + ", allowedLocality: "
+ allowedLocality + ", priority: " + schedulerKey.getPriority()
+ ", app attempt id: " + this.attemptId);
}
return assignContainer(node, nodeLocalPendingAsk, NodeType.NODE_LOCAL,
reserved, schedulerKey);
}
if (!appSchedulingInfo.canDelayTo(schedulerKey, node.getRackName())) {
continue;
}
if (rackLocalPendingAsk.getCount() > 0
&& (allowedLocality.equals(NodeType.RACK_LOCAL) || allowedLocality
.equals(NodeType.OFF_SWITCH))) {
if (LOG.isTraceEnabled()) {
LOG.trace("Assign container on " + node.getNodeName()
+ " node, assignType: RACK_LOCAL" + ", allowedLocality: "
+ allowedLocality + ", priority: " + schedulerKey.getPriority()
+ ", app attempt id: " + this.attemptId);
}
return assignContainer(node, rackLocalPendingAsk, NodeType.RACK_LOCAL,
reserved, schedulerKey);
}
PendingAsk offswitchAsk = getPendingAsk(schedulerKey,
ResourceRequest.ANY);
if (!appSchedulingInfo.canDelayTo(schedulerKey, ResourceRequest.ANY)) {
continue;
}
if (offswitchAsk.getCount() > 0) {
if (getAppPlacementAllocator(schedulerKey).getUniqueLocationAsks()
<= 1 || allowedLocality.equals(NodeType.OFF_SWITCH)) {
if (LOG.isTraceEnabled()) {
LOG.trace("Assign container on " + node.getNodeName()
+ " node, assignType: OFF_SWITCH" + ", allowedLocality: "
+ allowedLocality + ", priority: "
+ schedulerKey.getPriority()
+ ", app attempt id: " + this.attemptId);
}
return assignContainer(node, offswitchAsk, NodeType.OFF_SWITCH,
reserved, schedulerKey);
}
}
if (LOG.isTraceEnabled()) {
LOG.trace("Can't assign container on " + node.getNodeName()
+ " node, allowedLocality: " + allowedLocality + ", priority: "
+ schedulerKey.getPriority() + ", app attempt id: "
+ this.attemptId);
}
}
} finally {
writeLock.unlock();
}
return Resources.none();
}
有關Delay Schedulering 提高本地性的分析,詳見Hadoop Yarn延遲調度分析(Delay Schedulering)。
根據不同的本地性需求調用:
private Resource assignContainer(
FSSchedulerNode node, PendingAsk pendingAsk, NodeType type,
boolean reserved, SchedulerRequestKey schedulerKey) {
// 該請求需要的資源量
Resource capability = pendingAsk.getPerAllocationResource();
// 調度節點上可以被分配的資源量
Resource available = node.getUnallocatedResource();
//是否是被預留的容器
Container reservedContainer = null;
if (reserved) {
reservedContainer = node.getReservedContainer().getContainer();
}
// 如果需求的資源量小於節點可以被分配的資源量
if (Resources.fitsIn(capability, available)) {
// 把新分配的容器通知給應用程序
RMContainer allocatedContainer =
allocate(type, node, schedulerKey, pendingAsk,
reservedContainer);
if (allocatedContainer == null) {
if (reserved) {
unreserve(schedulerKey, node);
}
return Resources.none();
}
//預留的情況下取消預留
if (reserved) {
unreserve(schedulerKey, node);
}
// 把新分配的容器通知給節點
node.allocateContainer(allocatedContainer);
if (!isAmRunning() && !getUnmanagedAM()) {
setAMResource(capability);
getQueue().addAMResourceUsage(capability);
setAmRunning(true);
}
return capability;
}
if (LOG.isDebugEnabled()) {
LOG.debug("Resource request: " + capability + " exceeds the available"
+ " resources of the node.");
}
把新分配容器通知給application是我們關注的重點:
FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling() -> queueMgr.getRootQueue().assignContainer() -> FSLeafQueue.assignContainer() -> FSAppAttempt.assignContainer() -> FSAppAttempt.allocate():
public RMContainer allocate(NodeType type, FSSchedulerNode node,
SchedulerRequestKey schedulerKey, PendingAsk pendingAsk,
Container reservedContainer) {
RMContainer rmContainer;
Container container;
try {
writeLock.lock();
// 根據實際調度的本地性需求,對allowedLocalityLevel進行更新
NodeType allowed = allowedLocalityLevel.get(schedulerKey);
if (allowed != null) {
if (allowed.equals(NodeType.OFF_SWITCH) && (type.equals(
NodeType.NODE_LOCAL) || type.equals(NodeType.RACK_LOCAL))) {
this.resetAllowedLocalityLevel(schedulerKey, type);
} else if (allowed.equals(NodeType.RACK_LOCAL) && type.equals(
NodeType.NODE_LOCAL)) {
this.resetAllowedLocalityLevel(schedulerKey, type);
}
}
if (getOutstandingAsksCount(schedulerKey) <= 0) {
return null;
}
container = reservedContainer;
if (container == null) {
container = createContainer(node, pendingAsk.getPerAllocationResource(),
schedulerKey);
}
// 創建一個RMContainer對象,是容器在RM中的表現形式
rmContainer = new RMContainerImpl(container, schedulerKey,
getApplicationAttemptId(), node.getNodeID(),
appSchedulingInfo.getUser(), rmContext);
((RMContainerImpl) rmContainer).setQueueName(this.getQueueName());
// 把容器加入到已分配容器的列表中
addToNewlyAllocatedContainers(node, rmContainer);
liveContainers.put(container.getId(), rmContainer);
// 把容器的信息更新到appSchedulingInfo中
ContainerRequest containerRequest = appSchedulingInfo.allocate(
type, node, schedulerKey, container);
this.attemptResourceUsage.incUsed(container.getResource());
getQueue().incUsedResource(container.getResource());
RMContainer
((RMContainerImpl) rmContainer).setContainerRequest(containerRequest);
// 做完相關工作後,觸發RMContainer的狀態機
rmContainer.handle(
new RMContainerEvent(container.getId(), RMContainerEventType.START));
if (LOG.isDebugEnabled()) {
LOG.debug("allocate: applicationAttemptId=" + container.getId()
.getApplicationAttemptId() + " container=" + container.getId()
+ " host=" + container.getNodeId().getHost() + " type=" + type);
}
RMAuditLogger.logSuccess(getUser(), AuditConstants.ALLOC_CONTAINER,
"SchedulerApp", getApplicationId(), container.getId(),
container.getResource());
} finally {
writeLock.unlock();
}
return rmContainer;
}
我們看到這個函數主要是創建了Container在RM中的表示形式RMContainer,然後更新了一些相關信息後,觸發了RMContainer的狀態機,向其發送了RMContainerEventType.START狀態。對應的狀態機轉移操作爲:
addTransition(RMContainerState.NEW, RMContainerState.ALLOCATED,
RMContainerEventType.START, new ContainerStartedTransition())
private static final class ContainerStartedTransition extends
BaseTransition {
public void transition(RMContainerImpl container, RMContainerEvent event) {
container.rmContext.getAllocationTagsManager().addContainer(
container.getNodeId(), container.getContainerId(),
container.getAllocationTags());
container.eventHandler.handle(new RMAppAttemptEvent(
container.appAttemptId, RMAppAttemptEventType.CONTAINER_ALLOCATED));
}
}
這個狀態機跳變的伴隨操作,終於推動了RMAppAttemptEventType操作,之前沒有分配到容器的RMAppAttempt一直處於RMAppAttemptState.SCHEDULED狀態,現在終於可能可以脫離前進了。
addTransition(RMAppAttemptState.SCHEDULED,
EnumSet.of(RMAppAttemptState.ALLOCATED_SAVING,
RMAppAttemptState.SCHEDULED),
RMAppAttemptEventType.CONTAINER_ALLOCATED,
new AMContainerAllocatedTransition())
可以看到這是一個多弧跳變,相應的狀態轉移對應跳函數爲:
private static final class AMContainerAllocatedTransition
implements
MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {
public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
RMAppAttemptEvent event) {
// 想從獲取AM所需的那個容器
Allocation amContainerAllocation =
appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
EMPTY_CONTAINER_REQUEST_LIST, null, EMPTY_CONTAINER_RELEASE_LIST, null,
null, new ContainerUpdates());
//既然創建完一個RMContainer之後,CONTAINER_ALLOCATED觸發了,就說明至少存在一個容器可以獲取,並且存放在了 //SchedulerApplication#newlyAllocatedContainers 中。
//但是對應調度器FairScheduler的allocate函數不能保證一定能夠拉到容器,因爲容器可能不能被拉過來因爲某些原因,比如DNS不可達,就會回到上一個狀態也就是SCHEDULED狀態,然後再重新去獲取AM所需的一個容器。
if (amContainerAllocation.getContainers().size() == 0) {
appAttempt.retryFetchingAMContainer(appAttempt);
return RMAppAttemptState.SCHEDULED;
}
// 有容器獲取就分配給AM使用
appAttempt.setMasterContainer(amContainerAllocation.getContainers()
.get(0));
RMContainerImpl rmMasterContainer = (RMContainerImpl)appAttempt.scheduler
.getRMContainer(appAttempt.getMasterContainer().getId());
rmMasterContainer.setAMContainer(true);
appAttempt.rmContext.getNMTokenSecretManager()
.clearNodeSetForAttempt(appAttempt.applicationAttemptId);
appAttempt.getSubmissionContext().setResource(
appAttempt.getMasterContainer().getResource());
appAttempt.storeAttempt();
return RMAppAttemptState.ALLOCATED_SAVING;
}
}
在Hadoop Yarn 3.1.0 源碼分析 (02 作業調度)我們看到,當一個RMAppAttempt處於SUBMITTED狀態時,收到ATTEMPT_ADDED事件觸發的時候,會執行ScheduleTransition.transition()然後調用appAttempt.scheduler.allocate(),企圖從newlyAllocatedContainers集合中收攬已經分配的容器,若沒有則會停留在SCHEDULED狀態,若收攬到容器,則RMContainer會收到ACQUIRED事件,此時在心跳過後allocate中創建的RMContainer受到RMContainerEventType.START事件處於RMContainerState.ALLOCATED狀態。那麼對應的RMContainer的狀態機就能繼續推進。
addTransition(RMContainerState.ALLOCATED, RMContainerState.ACQUIRED,
RMContainerEventType.ACQUIRED, new AcquiredTransition())
我們回到上面AMContainerAllocatedTransition
if (amContainerAllocation.getContainers().size() == 0) {
appAttempt.retryFetchingAMContainer(appAttempt);
//沒有收攬到容器繼續停留在SCHEDULED狀態
return RMAppAttemptState.SCHEDULED;
}
appAttempt.storeAttempt();
return RMAppAttemptState.ALLOCATED_SAVING;
如果現在還是收攬不到容器,那麼會開一個線程週期性的去獲取。
appAttempt.retryFetchingAMContainer():
private void retryFetchingAMContainer(final RMAppAttemptImpl appAttempt) {
new Thread() {
public void run() {
try {
//500ms試一下
Thread.sleep(500);
} catch (InterruptedException e) {
LOG.warn("Interrupted while waiting to resend the"
+ " ContainerAllocated Event.");
}
//再次觸發這個事件進行嘗試
appAttempt.eventHandler.handle(
new RMAppAttemptEvent(appAttempt.applicationAttemptId,
RMAppAttemptEventType.CONTAINER_ALLOCATED));
}
}.start();
}
若RMAppAttempt收攬到了容器之後,那麼當前RMAppAttempt處於RMAppAttemptState.ALLOCATED_SAVING狀態,而appAttempt.storeAttempt()是異步的過程,結束了會給RMAppAttempt發送事件:
appAttempt.storeAttempt() -> RMStateStore.storeNewApplicationAttempt() :
public void storeNewApplicationAttempt(RMAppAttempt appAttempt) {
//....
getRMStateStoreEventHandler().handle(
new RMStateStoreAppAttemptEvent(attemptState));
}
public RMStateStoreAppAttemptEvent(ApplicationAttemptStateData attemptState) {
super(RMStateStoreEventType.STORE_APP_ATTEMPT);
this.attemptState = attemptState;
}
對應的狀態機轉移爲:
addTransition(RMStateStoreState.ACTIVE,
EnumSet.of(RMStateStoreState.ACTIVE, RMStateStoreState.FENCED),
RMStateStoreEventType.STORE_APP_ATTEMPT,
new StoreAppAttemptTransition())
private static class StoreAppAttemptTransition implements
MultipleArcTransition<RMStateStore, RMStateStoreEvent,
RMStateStoreState> {
public RMStateStoreState transition(RMStateStore store,
RMStateStoreEvent event) {
//.....
store.notifyApplicationAttempt(new RMAppAttemptEvent
(attemptState.getAttemptId(),
RMAppAttemptEventType.ATTEMPT_NEW_SAVED));
return finalState(isFenced);
};
}
可以看到保存完AppAttemp信息後像RMAppAttempt發送了RMAppAttemptEventType.ATTEMPT_NEW_SAVED事件:
addTransition(RMAppAttemptState.ALLOCATED_SAVING,
RMAppAttemptState.ALLOCATED,
RMAppAttemptEventType.ATTEMPT_NEW_SAVED, new AttemptStoredTransition())
RMAppAttempt進入了RMAppAttemptEventType.ALLOCATED狀態,狀態機對應的伴隨操作是:
private static final class AttemptStoredTransition extends BaseTransition {
@Override
public void transition(RMAppAttemptImpl appAttempt,
RMAppAttemptEvent event) {
//安全認證相關
appAttempt.registerClientToken();
appAttempt.launchAttempt();
}
}
private void launchAttempt(){
launchAMStartTime = System.currentTimeMillis();
// Send event to launch the AM Container
eventHandler.handle(new AMLauncherEvent(AMLauncherEventType.LAUNCH, this));
}