調試LLVM如何生成SSA

調試過程,垃圾內容,勿讀


我在文章《構造SSA》中介紹瞭如何構造SSA,也就是place ϕ\phi-functionrename到後面的SSA destruction。這篇文章一步步調試給出LLVM如何構造最終的SSA。

int fac(int num) {
	if (num == 1)
		return 1;
	return num * fac(num - 1)
}
int main() {
	fac(10);
}

在介紹llvm如何生成SSA之前,先介紹如何生成帶有ϕ\phi-instruction的IR。對IR不熟悉的話,《2019 EuroLLVM Developers’ Meeting: V. Bridgers & F. Piovezan “LLVM IR Tutorial - Phis, GEPs …”》是入門LLVM IR最好的視頻。

Clang itself does not produce optimized LLVM IR. It produces fairly straightforward IR wherein locals are kept in memory (using allocas). The optimizations are done by opt on LLVM IR level, and one of the most important optimizations is indeed mem2reg which makes sure that locals are represented in LLVM’s SSA values instead of memory. - 《How to get “phi” instruction in llvm without optimization

// test.c
int foo(int a, int b) {
	int r;
	if (a > b)
		r = a;
	else 
		r = b;
	return r;
}

對於上面的代碼,使用clang直接生成的IR如下所示,我們可以看到IR還是非常原始的。

// clang -S -emit-llvm test.c -o test_original.ll
; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
; Function Attrs: noinline nounwind optnone ssp uwtable
define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %r = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  %0 = load i32, i32* %a.addr, align 4
  %1 = load i32, i32* %b.addr, align 4
  %cmp = icmp sgt i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %2 = load i32, i32* %a.addr, align 4
  store i32 %2, i32* %r, align 4
  br label %if.end

if.else:                                          ; preds = %entry
  %3 = load i32, i32* %b.addr, align 4
  store i32 %3, i32* %r, align 4
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %4 = load i32, i32* %r, align 4
  ret i32 %4
}

attributes #0 = { noinline nounwind optnone ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git 36663d506e31a43934f10dff5a3020d3aad41ef1)"}

llvm中使用-mem2reg,來將上述IR中的allocastoreload指令刪除,並將代碼轉化爲SSA IR。

This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form. - mem2reg: Promote Memory to Register

生成SSA IR的命令

生成含有ϕ\phi-instruction的命令如下:

$clang -S -emit-llvm -Xclang -disable-O0-optnone test.c // 生成人類可讀的IR
$opt -mem2reg test.ll -o test.bc // 將IR轉換成SSA形式
$llvm-dis test.bc // 使用llvm-dis生成人類可讀的形式

上述指令中的-disable-O0-optnone來刪除optnone屬性,從而使opt能調用pass。第一條命令生成的結果如下:

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
; Function Attrs: noinline nounwind ssp uwtable
define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %r = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  %0 = load i32, i32* %a.addr, align 4
  %1 = load i32, i32* %b.addr, align 4
  %cmp = icmp sgt i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %2 = load i32, i32* %a.addr, align 4
  store i32 %2, i32* %r, align 4
  br label %if.end

if.else:                                          ; preds = %entry
  %3 = load i32, i32* %b.addr, align 4
  store i32 %3, i32* %r, align 4
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %4 = load i32, i32* %r, align 4
  ret i32 %4
}

attributes #0 = { noinline nounwind ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git 36663d506e31a43934f10dff5a3020d3aad41ef1)"}

第二條命令生成的結果如下:

; ModuleID = 'test.bc'
source_filename = "test.c"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
; Function Attrs: noinline nounwind ssp uwtable
define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %cmp = icmp sgt i32 %a, %b
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  br label %if.end

if.else:                                          ; preds = %entry
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %r.0 = phi i32 [ %a, %if.then ], [ %b, %if.else ]
  ret i32 %r.0
}

attributes #0 = { noinline nounwind ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git 36663d506e31a43934f10dff5a3020d3aad41ef1)"}

IRGen: Add optnone attribute on function during O0
[llvm-dev] Clang/LLVM 5.0 optnone attribute with -O0
LLVM opt mem2reg has no effect
Assignment 1: Introduction to LLVM
-O0 is not a recommended option for clang
opt is defunct when code built without optimizations

lDominatorTreeWrapperPass

dominator信息的計算是由lDominatorTreeWrapperPass完成的,這個pass也是命令opt -mem2reg test.ll -o test.bc在這個module上跑的第一個pass。

compute dominance tree & dominant frontier

llvm在2017年使用SEMI-NCA算法代替傳統的LT算法計算dominator信息,見《2017 LLVM Developers’ Meeting: J. Kuderski “Dominator Trees and incremental updates that transcend time》。

首先使用命令opt -dot-cfg ...生成示例代碼的CFG圖,如下所示:
cfg

The SEMI-NCA algorithm

SEMI-NCA算法是由[Dominators] Use Semi-NCA instead of SLT to calculate dominators提進llvm的。
注:關於SEMI-NCA算法的細節請見再談Dominator Tree的計算

Debug Process

debug process
上面這個圖展示了執行到DominatorTreeWrapperPass入口之前的調用關係,我們可以看到dominator pass是衆多passes中佔比很小的一部分。中間涉及到的各個類的繼承關係如下:

對應的代碼及其相關的註釋如下:

// LegacyPassManager.cpp

// PassManager manages ModulePassManagers
class PassManager : public PassManagerBase{
	// ...
};

// run - Execute all of the passes scheduled for execution. Keep track of 
// whether any of the passes modifies the module, and if so, return true.
bool PassManager::run(Module &M) {
	return PM->run(M);
}
//===----------------------------------------------------------------------===//
// MPPassManager
//
// MPPassManager manages ModulePasses and function pass managers.
// It batches all Module passes and function pass managers together and
// sequences them to process one module.
class MPPassManager : public Pass, public PMDataManager {
	// ...
};

// Execute all of the passes scheduled for execution by invoking
// runOnModule method. Keep track of whether any of the passes modifies
// the module, and if so, return true.
bool MPPassManager::runOnModule(Module &M) {
	// ...
	for (unsinged Index = 0; Index < getNumContainedPasses(); ++Index) {
		// ...
		LocalChanged |= MP->runOnModule(M);
		// ...
	}
}
// LegacyPasssManager.cpp::FPPassManager::runOnFunction

// FPPassManager manages BBPassManagers and FunctionPasses.
// It batches all function passes and basic block pass managers together and 
// sequence them to process one function at a time before processing next
// function.
class FPPassManager : public ModulePass, public PMDataManager {
// ...
};

// Execute all of the passes scheduled of execution by invoking
/// runOnFunction method. Keep track of whether any of the passes modifies
/// the function, and if so, return true.
bool FPPassManager::runOnFunction(Function &F) {
  // ...
  for (unsigned Index = 0; Index < getNumContainedPasses(); ++Index) {
  	FunctionPass *FP = getcontainedPass(Index);
  	bool LocalChanged = false;

	{
		// ...
		LocalChanged |= FP->runOnFunction(F);
		// ...
	}
  }
  // ...
}
bool DominatorTreeWrapperPass::runOnFunction(Function &F) {
  DT.recalculate(F);
  return false;
}

DominatorTreeBase::recalculate

下面就進入了真正的dominator tree計算過程,SemiNCAInfo<DomTreeT>::CalculateFromScratch執行具體的計算。

/// recalculate - compute a dominator tree for the given function
void recalculate(ParentType &Func) {
  Parent = &Func;
  DomTreeBuilder::Calculate(*this);
}
// ...
template <class DomTreeT>
void Calculate(DomTreeT &DT) {
  SemiNCAInfo<DomTreeT>::CalculateFromScratch(DT, nullptr);
}

SemiNCAInfo<DomTreeT>::CalculateFromScratch就是一個典型的SEMA-NCA的算法實現了,第一步doFullDFSWalk,第二步執行runSemiNCA

static void CalculateFromScratch(DomTree &DT, BatchUpdatePtr BUI) {
	auto *Parent = DT.Parent;
	DT.reset();
	DT.parent = Parent;
	SemiNCAInfo SNCA(nullptr); // Since we are rebuilding the whole tree,
							   // there is no point doing it incrementally.

	// Step #0: Number blocks in depth-first order and initialize variables used 
	// in later stages of the algorithm.
	DT.Roots = FindRoots(DT, nullptr);
	SNCA.doFullDFSWalk(DT, AlwaysDescend);

	SNCA.runSemiNCA(DT);
	if (BUI) {
		BUI->IsRecalculated = true;
		LLVM_DEBUG(
			dbgs() << "DomTree recalculated, skipping future batch updates\n");
	}

	if (DT.Roots.empty()) return;

	// Add a node for the root. If the tree is a PostDominatorTree it will be
	// the virtual exit (denoted by (BasicBlock *) nullptr) which postdominates
	// all real exits (including multiple exit blocks, infinite loops).
	NodePtr Root = IsPostDom ? nullptr : DT.Roots[0]

	DT.RootNode = (DT.DomTreeNodes[Root] = 
					std::make_unique<DomTreeNodeBase<NodeT>>(Root, nullptr)).get();
	SNCA.attachNewSubTree(DT, DT.RootNode);
}

runDFS

runDFS是一個棧實現的典型深度優先遍歷,其中對BasicBlock進行了DFS編號,並記錄了逆children關係,這裏就不展開了。

// Custom DFS implementation which can skip nodes based on a provided
// predicate. It also collects ReverseChildren so that we don't have to spend 
// time getting predecessors in SemiNCA.
//
// If IsReverse is set to true, the DFS walk will be performed backwards
// relative to IsPostDom -- using reverse edges for dominators and forward
// edges for postdominators.
template <bool IsReverse = false, typename DescendCondition>
unsigned runDFS(NodePtr V, unsigned LastNum, DescendCondition Condition, unsigned AttachToNum) {
	// ...
}

經過runDFS之後,最開始的CFG圖變爲下面的樣子。
DFS

runSemiNCA

runSemiNCA可以分爲典型的兩步,第一步以reverse preorder計算sdomsdom值,第二步以preorder序通過NCA計算idomidom值。

// This function requires DFS to be run before calling it.
void runSemiNCA(DomTreeT &DT, const unsigned MinLevel != 0) {
	const unsigned NextDFSNum(NumToNode.size());
	// Initialize IDoms to spanning tree parents.
	for (unsigned i = 1; i < NextDFSNum; ++i) {
		const NodePtr V = NumToNode[i];
		auto &VInfo = NodeToInfo[V];
		VInfo.IDom = NumToNode[VInfo.Parent];
	}

	// Step #1: Calculate the semidominators of all vertices.
	SmallVector<InfoSec *, 32> EvalStack;
	for (unsigned i = NextDFSNum - 1; i >= 2; --i) {
		NodePtr W = NumToNode[i];
		auto &WInfo = NodeToInfo[W];

		// Initialize the semi dominator to point to the parent node.
		WInfo.Semi = WInfo.Parent;
		for (const auto &N : WInfo.ReverseChildren) {
			if (NodeToInfo.count(N) == 0) // Skip unreachable predecessors.
				continue;
			
			const TreeNodePtr TN = DT.getNode(N);
			// Skip predecessors whose level is above the subtree we are processing.
			if (TN & TN->getLevel() < MinLevel)
				continue;
			
			unsigned SemiU = NodeToInfo[eval(N, i + 1, EvalStack)].Semi;
			if (SemiU < WInfo.Semi) WInfo.Semi = Semi;
		}
	}

	// Step #2: Explicitly define the immediate dominator of each vertex.
	// 			IDom[i] = NCA(SDom[i], SpanningTreeParent(i)).
	// Note that the parents were stored in IDoms and later got invalidated
	// during path conpression in Eval.
	for (unsigned i = 2; i < NextDFSNum; ++i) {
		const NodePtr W = NumToNode[i];
		auto &WInfo = NodeToInfo[W];
		const unsigned SDomNum = NodeToInfo[NumToNode[WInfo.Semi]].DFSNum;
		NodePtr WIDomCandidate = WInfo.IDom;
		while (NodeToInfo[WIDomCandidate].DFSNum > SDomNum)
			WIDomCandidate = NodeToInfo[WIDomCandidate].IDom;
		
		WInfo.IDom = WIDomCandidate;
	}
}

Step #1執行完成之後,CFG如下圖所示。
eval
Step #2執行完成之後,CFG如下圖所示。
IDom

mem2reg

pass mem2reg存在於llvm/lib/Transforms/Utils/Mem2Reg.cpp,我把斷點打在Mem2Reg.cpp::PromoteLegacyPass::runOnFunction函數體裏,call stack如下。

// commit 36663d506e31a43934f10dff5a3020d3aad41ef1
// vscode lldb

// Call Stack
(anoymous namespace)::PromoteLegacyPass::runOnFunction(llvm::Function&)    Mem2Reg.cpp
llvm::FPPassManager::runOnFunction(llvm::Function&)                        LegacyPassManager.cpp
llvm::FPPassManager::runOneModule(llvm::Module&)                           LegacyPassManager.cpp
(anonymous namespace)::MPPassManager::runOneModule(llvm::Module&)          LegacyPassManager.cpp
llvm::legacy::PassManagerImpl::run(llvm::Module&)                          LegacyPassManager.cpp
llvm::legacy::PassManager::run(llvm::Module&)                              LegacyPassManager.cpp
main opt.cpp

runOnFunction的函數體如下:

// runOnFunction - To run this pass, first we calculate the alloca
// instructions that are safe for promotion, then we promote each one.
bool runOnFunction() override {
	if (skipFunction(F))
		return false;
	
	DominatorTree &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
	AssumptionCache &AC = 
		getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
	return promoteMemoryToRegister(F, DT, AC);
}

整個程序的執行時一個Call Tree,但是debugger hit到某個斷點,只是展現出當前的一個path。而像lDominatorTreeWrapperPass的執行就是在前面完成的。
Call Stack

place ϕ\phi-function

In LLVM the transformation from stack variables to register values is performed in optimization passes. Running a mem2reg optimization pass on the IR will transform memory objects to register values whenever possible (or the heuristics say so). The optimization pass is implemented in PromoteMemoryToRegister.cpp which analyzes the BasicBlocks and the alloca instructions for PHINode placement. The PHINode placement is calculated with algorithm by Sreedhar and Gao that has been modified to not use the DJ (Dominator edge, Join edge) graphs. According to Sreedhar and Gao the algorithm is approximately five times faster on average than the Cytron et al. algorithm. The speed gain results from calculating dominance frontiers for only nodes that potentially need phi nodes and well designed data structures. LLVM SSA
Skip to end of metadata

我們知道生成SSA分三步走,

  • 計算dominate信息
  • 插入ϕ\phi-instruction
  • rename

在我們dominate信息計算完成之後,後面就是插入ϕ\phi-intruction,這個過程由PromoteMem2Reg::run()完成,run()方法分爲兩個大部分,一是place ϕ\phi-instrunction,一是rename

// PromoteMemoryToRegister.cpp
// This file promotes memory references to be register references. It promotes
// alloca instructions which only have loads and stores as uses. An alloca is
// transformed by using iterated dominator order to rewrite loads and stores as 
// appropriate.

struct PromoteMem2Reg {
	// The alloca instructions being promoted.
	std::vector<AllocaInst *> Allocas;

	DominatorTree &DT;

	const SimplifyQuery SQ;

	// Reverse mapping of Allocas.
	DenseMap<AllocaInst *, unsigned> AllocaLookup;

	// The PhiNodes we're adding.
	//
	// That map is used to simplify some Phi nodes as we iterate over it, so
	// it should have deterministic iterators. We could use MapVector, but
	// since we already maintain a map from BasicBlock* to a stable numbering
	// (BBNumbers), the DenseMap is more efficient (also supports removal).
	DenseMap<std::pair<unsigned, unsigned>, PHINode *> NewPhiNodes;

	// For each PHI node, keep track of which entry in Allocas it corresponds
	// to.
	DenseMap<PHINode *, unsigned> PhiToAllocaMap;
	
	// The set of basic blocks the renamer has already visited.
	SmallPtrSet<BasicBlock *, 16> Visited;

	// Contains a stable numbering of basic blocks to avoid non-deterministic
	// behavior.
	DenseMap<BasicBlock *, unsigned> BBNumbers;

	// Lazily compute the number of predecessors a block has.
	DenseMap<const BasicBlock *, unsigned> BBNumPreds;

	void run();
private:
	void ComputeLiveInBlocks(AllocaInst *AI, AllocaInfo &Info, 
							const SmallPtrSetImpl<BasicBlock &> &DefBlocks,
							SmallPtrSetImpl<BasicBlock *> &LiveInBlocks);

	void RenamePass(BasicBlock *BB, BasicBlock *Pred,
					RenamePassData::ValVector &IncVals,
					RenamePassData::LocationVector &InstLocs,
					std::vector<RenamePassData> &WorkList);

	bool QueuePhiNode(BasicBlock *BB, unsigned AllocaIdx, unsigned &Version);
};

void PromoteMem2Reg::run() {
	Function &F = *DT.getRoot()->getParent();

	AllocaDgbDeclares.resize(Allocas.size());

	AllocaInfo Info;
	LargeBlockInfo LBI;
	ForwardIDFCalculator IDF(DT);

	// 第一部分,place phi node
	for(unsigned AllocaNum = 0; AllocaNum != Allocas.size(); ++AllocaNum) {
		AllocaInst *AI = Allocas[AllocaNum];
		if (AI->use_empty()) {
			// If there are no uses of the alloca, just delete it now.
			AI->eraseFromParent();

			// Remote the alloca from the Allocas list, since it has been processed
			RemoveFromAllocasList(AllocaNum);
			++NumDeadAlloca;
			continue;
		}

		// Calculate the set of read and write-locations for each alloca. This is
		// analogous to finding the 'uses' and 'definitions' of each variable.
		Info.AnalyzeAlloca(AI);

		// If there is only a single store to this value, replace any loads of
		// it that are directly dominated by the definition with the value stored.
		if (Info.DefiningBlocks.size() == 1) {
			if (rewritingSingleStoreAlloca(AI, Info, LBI, SQ.DL, DT, AC)) {
				// The alloca has been processed, move on.
				RemoveFromAllocaList(AllocaNum);
				++NumSingleStore;
				contiune;
			}
		}

		// If the alloca is only read and written in one basic block, just perform a 
		// linear sweep over the block to eliminate it.
		if (Info.OnlyUsedInOneBlock && 
			promoteSingleBlockAlloca(AI, Info, LBI, SQ.DL, DT, AC)) {
			// The alloca has been processed, move on.
			RemoveFromAllocasList(AllocaNum);
			continue;
		}

		// ...

		// Unique the set of defining blocks for efficient lookup.
		SmallPtrSet<BasicBlock *, 32> DefBlocks(Info.DefiningBlocks.begin(),
												Info.DefineingBlocks.end());
		
		// Determine which blocks the value is live in. These are blocks which lead
		// to uses.
		SmallPtrSet<BasicBlock *, 32> LiveInBlocks;
		ComputeLiveInBlocks(AI, Info, DefBlocks, LiveInBlocks);

		// At this point, we're commited to promoting the alloca using IDF's, and
		// the standard SSA construction algorithm. Determine which blocks need PHI
		// nodes and see if we can optimize out some work by avoiding insertion of
		// dead phi nodes.
		IDF.setLiveInBlocks(LiveBlocks);
		IDF.setDefiningBlocks(DefBlocks);
		SmallVector<BasicBlock *, 32> PHIBlocks;
		IDF.calculate(PHIBlocks);
		llvm::sort(PHIBlocks, [this](BasicBlock *A, BasicBlock *B) {
			return BBNumbers.find(A)->second < BBNumbers.find(B)->second;
		});

		unsigned CurrentVersion = 0;
		for (BasicBlock *BB : PHIBlocks)
			QueuePhiNode(BB, AllocaNum, CurrentVersion);
	}

	// 第二部分 rename pass
	// ...
}

run()方法的第一部分是一個for循環,用於處理 alloca instruction,計算其對應的ϕ\phi-instructions。我們回顧一下最開始的IR,有3個alloca指令,其中store指令可以看做一次 defdef

define i32 @foo(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4  // 第一條alloca指令 %a.addr
  %b.addr = alloca i32, align 4  // 第二條alloca指令 %b.addr
  %r = alloca i32, align 4       // 第三條alloca指令 %r
  store i32 %a, i32* %a.addr, align 4 // %a.addr的定義
  store i32 %b, i32* %b.addr, align 4 // %b.addr的定義
  %0 = load i32, i32* %a.addr, align 4 // %a.addr的讀取
  %1 = load i32, i32* %b.addr, align 4 // %b.addr的讀取
  %cmp = icmp sgt i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %2 = load i32, i32* %a.addr, align 4
  store i32 %2, i32* %r, align 4   // %r的第一個定義
  br label %if.end

if.else:                                          ; preds = %entry
  %3 = load i32, i32* %b.addr, align 4
  store i32 %3, i32* %r, align 4   // %r的第二個定義
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  %4 = load i32, i32* %r, align 4  // %r的讀取
  ret i32 %4
}

收集alloca信息

這一部分主要是收集關於alloca instruction的一些信息,例如有哪些store,有哪些load,然後剔除一些完全不需要ϕ\phi-instruction的alloca instruction。收集AllocaInfo關注點在於,store instruction所在的BasicBlockload instruction所在的BasicBlock它們是否在同一個BasicBlock中

// PromoteMemoryToRegister.cpp
struct AllocaInfo {
	// Scan the uses of the specified alloca, filling in the AllocaInfo used 
	// by the rest of the pass to reason about the uses of this alloca.
	void AnalyzeAlloca(AllocaInst *AI) {
		// As we scan the uses of the alloca instruction, keep track of stores,
		// and decide whether all of the loads and stores to the alloca are within
		// the same basic block.
		for (auto UI = AI->user_begin(), E = AI->user_end(); UI != E;) {
			// ...
		}
	}
}

針對這些不同的情況又有不同的處理,

  • DefiningBlocks.size()=1
  • OnlyUsedInOneBlock
  • 正常情況

DefiningBlocks.size()=1

示例IR中的%a.addr就屬於這一情況,對這一部分的處理主要集中rewriteSingleStoreAlloca()函數實現的,這個函數的核心在於將storeload這一個過程刪掉,直接將欲store的值,直接替換到所有load指令被使用的地方。整個過程就是減少ϕ\phi節點的插入,我唯一不能理解的是隻有這一個store,難道還不能dominate所有的load,是IR信息不全不能完全保證dominate?

如下圖所示,經過這一過程,與%a.addr相關的指令都直接刪除了,直接將store到%a.addr的值%a替換到所有使用load%a.addr值的位置。
alloca - a

// Rewrite as many loads as possible given a single store
// 
// When there is only a single store, we can use the domtree to trivially
// replace all of the dominated loads with the stored value. Do so, and return
// true if this has successfully promoted the alloca entirely. If this returns
// false there were some loads which were not dominated by the single store
// and thus must be phi-ed with undef. We fall back to the standard alloca
// promotion algorithm in that case.
static bool rewriteSingleStoreAlloca(AllocaInst *AI, AllocaInfo &Info,
									LargeBlockInfo &LBT, const DataLayout &DL,
									DominatorTree &DT, Assumption *AC) {
	//... 代碼我就不貼了
}

OnlyUsedInOneBlock

正常情況

正常情況第一步是計算AllocaInst會在哪些BasicBlock入口活躍。

ComputeLiveInBlocks

One drawback of minimal SSA form is that it may place φ-functions for a variable
at a point in the control-flow graph where the variable was not actually live prior
to SSA. - Static Single Assignment Book

One possible way to do this is to perform liveness analysis prior to SSA construction, and then use the liveness information to suppress the placement of φ-functions as described above; another approach is to construct minimal SSA and then remove the dead φ-functions using dead code elimination. - Static Single Assignment Book

Pruned SSA form,。剔除一些不需要插入ϕ\phi-instruction的BasicBlock,因爲反正也是死的。

// Determine which blocks the value is live in.
//
// These are blocks which to lead to uses. Knowning this allows us to avoid
// inserting PHI nodes into blocks which don't lead to uses (thus, the
// inserted phi nodes would be dead).
void PromoteMem2Reg::ComputeLiveInBlocks(
	AllocaInst *AI, AllocaInfo &Info,
	const SmallPtrSetImpl<BasicBlock *> &DefBlocks,
	SmallPtrSetImpl<BasicBlock *> &LiveBlocks) {
	// To determine liveness, we must iterate through the predecessors of blocks
	// where the def is live. Blocks are added to the worklist if we need to
	// check their predecessors. Start with all the using blocks.
	SmallVector<BasicBlock *, 64> LiveBlockWorklist(Info.UsingBlocks.begin(),
													Info.UsingBlocks.end());

	// If any of the using blocks is also a definition block, check to see if the
	// definition occurs before or after the use. If it happens before the use,
	// the value isn't realy live-in.
	
}

插入ϕ\phi node

// Calculate iterated dominance frontiers
// 
// This uses the linear-time phi algorithm based on DJ-graphs mentioned in
// the file-level comment. It performs DF->IDF pruning using the live-in
// set, to avoid computing the IDF for blocks where an inserted PHI node
// would be dead.
void calculate(SmallVectorImpl<NodeTy*> &IDFBlocks);
DJ-graphs(Dominator edge, Join edge)

關於DJgraphDJ-graph的細節,可以參考論文 A Linear Time Algorithm for Placing phi-Nodes:閱讀筆記

With dominance frontiers, the compiler can determine more precisely where ϕ\phi-functions might be needed. The basic idea is simple. A definition of xx in block bb forces a ϕ\phi-function at very node in DF(b)DF(b). Since that ϕ\phi-function is a new definition of xx, it may, in turn, force the insertion of additional ϕ\phi-functions.

Iterated Dominance Fontier IDF.calculate(PHIBlocks)

計算ϕ\phi-node的核心在於IDFCalculatorBase類,IDF的意思是iterated dominance frontier的意思,核心算法就是DJ-graph。在PromoteMem2Reg::run()函數中,針對單個alloca instruction,我們已經執行完IDF.setLiveInBlocks(LiveBlocks)IDF.setDefiningBlocks(DefBlocks),下一步就是計算插入ϕ\phi-node的BasicBlock,這一步的核心是IDF.calculate(PHIBlocks)

根據示例代碼,結合DJgraphDJ-graph,解釋一下下面的代碼。

template<class NodeTy, bool IsPostDom>
void IDFCalculatorBase<NodeTy, IsPostDom>::calculate(
	SmallVectorImpl<NodeTy *> &PHIBlocks) {
	// Use a priority queue keyed on dominator tree level so that inserted nodes
	// are handled from the bottom of the dominator tree upwards. We also augment
	// the level with a DFS number to ensure that the blocks are ordered in a
	// deterministic way.
	
	IDFPriorityQueue PQ;
	DT.updateDFSNumbers();

	for (NodeTy *BB : *DefBlocks) {
		if (DomTreeNodeBase<NodeTy> *Node = DT.getNode(BB)) {
			PQ.push({Node, std::make_pair(Node->getLevel(), Node->DFSNumIn())})
		}
	}

	while(!PQ.empty()) {
		DomTreeNodePair RootPair = PQ.top();
		PQ.pop();
		DomTreeNodeBase<NodeTy> *Root = RootPair.first;
		unsigned RootLevel = RootPair.second.first;

		// Walk all dominator tree children of Root, inspecting their CFG edge with
		// target elsewhere on the dominator tree. Only targets whose level is at
		// most Root's level are added to the iterated dominator frontier of the
		// definition set.
		Worklist.clear();
		Worklist.push_back(Root);
		VisitiedWorklist.insert(Root);

		while(!Worklist.empty()) {
			DomTreeNodeBase<NodeTy> *Node = Worklist.pop_back_val();
			NodeTy *BB = Node->getBlock();
			// Succ is the successor in the direction we are calculating IDF, so it is
			// successor for IDF, and predecessor for Reverse IDF.
			auto DoWork = [&](NodeTy *Succ) {
				DomTreeNodeBase<NodeTy> *SuccNode = DT.getNode(Succ);

				const unsigned SuccLevel = SuccNode->getLevel();
				if (SuccLevel > RootLevel)
					return;
				
				if (!VisitedPQ.insert(SuccNode).second)
					return;
				
				NodeTy *SuccBB = SuccNode->getBlock();
				if (useLiveIn && !LiveInBlocks->count(SuccBB))
					return;
				
				PHIBlocks.emplace_back(SuccBB);
				if (!DefBlocks->count(SuccBB))
					PQ.push_back(std::make_pair(
						SuccNode, std::make_pair(SuccLevel, SuccNode->getDFSNumIn())));
			};

			for (auto Succ : ChildrenGetter.get(BB))
				DoWork(Succ);
			
			for (auto DomChild : *Node) {
				if (VisitedWorklist.insert(DomChild).second)
					Worklist.push_back(DomChild);
			}
		}
	}
}

DJ
CFG中有兩個節點 if.thenif.thenif.elseif.else%r\%r進行了定義,最終得到的ϕ\phi blockif.endif.end。需要注意的是原始的DefiningBlocks沒有if.end,但是由於需要在if.end插入phi-instruction,這是一個新的defdef,所以需要將其放入PQ中。

PromoteMem2Reg::QueuePhiNode

在計算完需要插入ϕ\phi blocks以後,llvm會創建一個新的PHINode對象,然後將其記錄到PhiToAllocaMap中。

// Queue a phi-node to be added to a basic-block for a specific Alloca.
//
// Returns true if there wasn't already a phi-node for that variable.
bool PromoteMem2Reg::QueuePhiNode(BasicBlock *BB, unsigned AllocaNo,
								unsigned &Version) {
	// ...
}

run()方法的第一部分執行完之後,%a.addr\%a.addr%b.addr\%b.addr%r\%r所處的狀態應該像下面的樣子。此時我們已經構造好ϕ\phi-node,並收集了這些ϕ\phi-node所要插入的BasicBlock
llvm SSA示意圖
注:由於%a.addr%b.addr比較簡單,上圖中的紅色表示我們已經將相關的指令處理完成了

in memory llvm IR還沒有處理完成,上圖中的text IR是我手寫出來的,大概是那麼意思。

rename

當收集完PhiToAllocaMap以後,就要進行下一步rename過程。首先我們要明確,處理的IR是in memory的IR,llvm IR通過user和use相互勾連,在memory中就是一個指過來指過去的圖。在《構造SSA》中我展示的感覺好像rename就真的是重命名的意思,但rename的核心是將defdefϕ\phi-instruction勾連起來,所謂的name只是表層的含義,name就是defdef。而在llvm IR中defdef就是向store instruction所要存儲的值。

所以理解llvm rename的核心,就在於

  • 挑出來store instruction,把要存儲的值,與alloca instruction關聯起來,方便以後塞進ϕ\phi-instruction 的參數中
  • 挑出來load instruction,看情況替換成前面store instruction要存儲的值,或者替換成ϕ\phi-instruction
  • 當然這個需要按照值流動的順序來處理
  • 最後刪除storeload指令
void PromoteMem2Reg::run() {
	Function &F = *DT.getRoot()->getParent();

	AllocaDgbDeclares.resize(Allocas.size());

	AllocaInfo Info;
	LargeBlockInfo LBI;
	ForwardIDFCalculator IDF(DT);
	// 第一部分,place phi node
	// ...
	// 第二部分 rename pass
	// Walks all basic blocks in the funtion performing the SSA rename algorithm
	// and inserting the phi nodes we marked as necessary.
	std::vector<RenamePassData> RenamePassWorkList;
	RenamePassWorkList.emplace_back(&F.front(), nullptr, std::move(Values),
									std::move(Locations));
	do {
		RenamePassData PRD = std::move(RenamePassWorkList.back());
		RenamePassWorkList.pop_back();
		// RenamePass may add new worklist entries.
		RenamePass(PRD.BB, PRD.Pred, PRD.Values, PRD.Locations, RenamePassWorkList);
	} while (!RenamePassWorkList.empty());
}

上面的代碼預定義了與alloca instruction相關的數據,我們現在要處理只有一條alloca instruction(另外兩條已經處理了),所以預定義的數據只有一條。然後初始化,RenamePassWorkList爲整個Function的第一個BasicBlock,然後轉入整個rename過程的核心RenamePass()

PromoteMem2Reg::RenamePass()

整個renmae pass比較核心的一個結構是IncomingVals,它的類型是下面的結構中的ValVector

struct RenamePassData {
	using ValVector = std::vector<Value *>;
	BasicBlock *BB;
	BasicBlock *Pred;
	ValVector Values;
	
};

而這個IncomingVals通過worklist就起到了與《構造SSA》中rename過程中的棧類似的作用。存儲了當前我們應該使用的defdef

處理store instruction & load instruction

// Recursively traverse the CFG of the function, renaming loads and
// stores to the allocas which we are promoting.
//
// IncomingVals indicates what value each Alloca contains on exit from the
// predecessor block Pred.
void PromoteMem2Reg::RenamePass(BasicBlock *BB, BasicBlock *Pred,
								RenamePassData::ValVector &IncomingVals,
								RenamePassData::LocationVector &IncomingLocs,
								std::vector<RenamePassData> &Worklist) {
NextIteration:
	// If we are inserting any phi nodes into this BB, they will already be in the
	// block.
	// 第一部分:填充 phi-node
	// 第二部分:收集store instruction & alloca instruction
	// Don't revisit blocks
	if (!Visited.insert(BB).second)
		return;
	
	for (BasicBlock::iterator II = BB->begin(); !II->isTerminator();) {
		Instruction *I = &*II++; // get the instruction, increment iterator
		
		if (LoadInst *LI = dyn_cast<LoadInst>(I)) {
			AllocaInst *Src = dyn_cast<AllocaInst>(LI->getPointerOperand());
			if (!Src)
				continue;
			
			DenseMap<AllocaInst *, unsigned>::iterator AI = AllocaLookup.find(Src);
			if (AI == AllocaLookup.end())
				continue;
			
			Value *V = IncomingVals[AI->second];

			// If the load was marked as nonnull we don't want to lose
			// that information when we erase this Load. So we preserve
			// it with an assume.
			// ...

			// Anything using the load now uses the current value.
			LI->replaceAllIUsesWith(V);
			BB->getInstList().erase(LI);
		} else if (StoreInst *SI = dyn_cast<StoreInst>(I)) {
			// Delete this instruction and mark the name as the current holder of the
			// value
			AllocaInst *AI = dyn_cast<AllocaInst>(SI->getPointerOperand());
			if (!Dest)
				continue;
			
			DenseMap<AllocaInst *, unsigned>::iterator ai = AllocaLookup.find(Dest);
			if (ai == AllocaLookup.end())
				continue;
			
			// What value were we writing?
			unsigned AllocaNo = ai->second;
			IncomingVals[AllocaNo] = SI->getOperand(0);

			BB->getInstList().erase(SI);
		}
	}
	// 第三部分:更新迭代數據
}

對於load instruction,將所有使用到load instruction的地方替換爲收集到的源操作數alloca指令的當前的值,也就是當前defdef的值,並將load instruction刪除。

對於store instruction,更新defdef的值,然後刪除store instruction

填充ϕ\phi-node

void PromoteMem2Reg::RenamePass(BasicBlock *BB, BasicBlock *Pred,
								RenamePassData::ValVector &IncomingVals,
								RenamePassData::LocationVector &IncomingLocs,
								std::vector<RenamePassData> &Worklist) {
NextIteration:
	// If we are inserting any phi nodes into this BB, they will already be in the
	// block.
	// 第一部分:填充 phi-node
	if (PHINode *APN = dyn_cast<PHINode>(BB->begin())) {
		// If we have PHI nodes to update, compute the number of edges from Pred to
		// BB.
		if (PhiToAllocaMap.count(APN)) {
			// We want to be able to distinguish between PHI nodes being inserted by
			// this invocation of mem2reg from those phi nodes that already existed in
			// the IR before mem2reg was run. We determine that APN is being inserted
			// because it is missing incoming edges. All other PHI nodes being
			// inserted by this pass of mem2reg will have the same number of incoming
			// operands so far. Remember this count.
			unsigned NewPHINumOperands = APN->getNumOperands();

			unsigned NumEdges = std::count(succ_begin(Pred), succ_end(Pred), BB);

			// Add entries for all the phis.
			BasicBlock::iterator PNI = BB->begin();
			do {
				unsigned AllocaNo = PhiToAllocaMap[APN];
				
				// Update the location of the phi node.
				updateForIncomingValueLocation(APN, IncomingLocs[AllocaNo],
											APN->getNumIncomingValues() > 0);

				// Add N incoming values to the PHI node.
				for (unsigned i = 0; i != NumEdges; ++i) 
					APN->addIncoming(IncomingVals[AllocaNo], Pred);
				
				// The currently active variable for this block is now the PHI.
				IncomingVals[AllocaNo] = APN;

				// Get the next phi node.
				++PHI;
				APN = dyn_cast<PHINode>(PNI);
				if (!APN)
					break;
			} while(APN->getNumOperands() == NewPHINumOperands);
		}
	}
	// 第二部分:收集store instruction & alloca instruction
	// 第三部分:更新迭代數據
}

如果遍歷到了ϕ\phi-node,此時一定是通過predecessor 迭代下來的,IncomingVals數組存儲了從相應的predecessor中傳遞過來的defdef,然後以<defdef, predpred> pair的形式填充ϕ\phi-node的一個operand。而do{}while()的形式,是因爲通常有很多ϕ\phi-node,但像我們這裏只有一條ϕ\phi-node。

整個迭代過程

// Recursively traverse the CFG of the function, renaming loads and
// stores to the allocas which we are promoting.
//
// IncomingVals indicates what value each Alloca contains on exit from the
// predecessor block Pred.
void PromoteMem2Reg::RenamePass(BasicBlock *BB, BasicBlock *Pred,
								RenamePassData::ValVector &IncomingVals,
								RenamePassData::LocationVector &IncomingLocs,
								std::vector<RenamePassData> &Worklist) {
NextIteration:
	// If we are inserting any phi nodes into this BB, they will already be in the
	// block.
	// 第一部分:填充 phi-node
	// 第三部分:收集store instruction & alloca instruction
	// 第三部分:更新數據

	// 'Recurse' to our successors.
	succ_iterator I = succ_begin(BB), E = succ_end(BB);
	if (I == E)
		return;
	
	// Keep track of the successors so we don't visit the same successor twice
	SmallPtrSet<BasicBlock *, 8> VisitiedSuccs;

	// Handle the first successor without using the worklist.
	VisitedSuccs.insert(*I);
	Pred = BB;
	BB = *I;
	for (; I != E; ++I)
		if (VisitedSuccs.insert(*I).second)
			Worklist.emplace_back(*I, Pred, IncomingVals, IncomingLocs);
	
	goto NextIteration;
}

RenamePass()上層還有一個do{}while()循環,處理Worklist中的數據。結合我們的示例代碼,整個過程如下圖所示:
最終結果

清理

最終的清理很簡單,包括以下幾步:

  • 就是刪除alloca指令
  • merge incoming值相同的ϕ\phi-node
  • 補齊一些不可達basic block中的ϕ\phi-node
void PromoteMem2Reg::run() {
	// 清理部分
	// Remove the allocas themselves from the function
	for (Instruction *A : Allocas) {
		// If there are any uses of the alloca instructions left, they must be in
		// unreachable basic blocks that were not processed by walking the dominator
		// tree. Just delete the users now.
		if (!A->use_empty())
			A->replaceAllUsesWith(UndefValue::get(A->getType()));
		A->reaseFromParent();
	}

	// Loop over all of the PHI nodes and see if there are any that we can get
	// rid of because they merge all of the same incoming values. This can
	// happen due to undef values coming into the PHI nodes. This process is
	// iterative, because eliminating one PHI node can cause others to be removed.
	// ...

	// At this point, the renamer has added entries to PHI nodes for all reachable
	// code. Unfortunately, there may be unreachable blocks which the renamer
	// hasn't traversed. If this is the case, the PHI nodes may not 
	// have incoming values for all predecessors. Look over all PHI nodes we have
	// created, inserting undef values if they are missing any incoming values.
	// ...
}

至此整個過程就完成了,然後將這個pass的狀態變量LocalChanged=true。當然,由於我們使用了命令opt -mem2reg test.ll -o test.bc,後面會有一個BitcodeWriterPass

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章