Lua C API調用性能測試

最近自己做的一些小項目裏面用到了Lua和C API混合編程。在處理事件上有兩種設計，一種是在C層通過消息隊列接收消息並根據消息類型調用對應的Lua函數，並向Lua層提供AddListener這樣註冊回調的方法。另一種是直接將消息隊列方法暴露給Lua層，例如PushEvent，GetEvent等，然後在Lua層編寫一些代碼用來處理事件。最開始採用的是第一種方案，後來發現當消息量增多時會有一些卡頓，於是就想到是不是設計上帶來了一些性能缺陷，通過下面的代碼進行驗證：

int test_function(lua_State* L)
{
	int a = lua_tointeger(L, 1);
	int b = lua_tointeger(L, 2);
	int c = a + b;
	lua_pushinteger(L, c);
	return 1;
}

int test_in_c(int a, int b)
{
	return a + b;
}

int benchmark()
{
	LuaVM L;
	lua_pushcfunction(L, test_function);
	lua_setglobal(L, "ctestfn");

	luaL_loadstring(L, "for i=1, 10000000 do ctestfn(1, 2) end");

	clock_t before = clock();
	lua_pcall(L, 0, 0, -1);
	clock_t after = clock();

	cout << "Loop in Lua, call into C: " << ((double)after - before)/CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do ctestfn(1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call into C (unprotected): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do pcall(ctestfn, 1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call into C: (pcall) " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do xpcall(ctestfn, function() print(debug.traceback()) end, 1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call into C: (xpcall) " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	// Lua, Lua

	luaL_loadstring(L, "function testfn(a,b) return a+b end for i=1, 10000000 do testfn(1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do pcall(testfn, 1, 2) end");
	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua (with pcall): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do xpcall(testfn, function() print(debug.traceback()) end, 1, 2) end");
	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua (with xpcall): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "x=coroutine.create(function(a,b) while true do a,b=coroutine.yield(a+b) end end) for i=1, 10000000 do coroutine.resume(x, 1, 2) end");
	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua (with coroutine): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	before = clock();
	for (int i = 0; i < 10000000; i++)
	{
		test_in_c(1, 2);
	}
	after = clock();
	
	cout << "Loop in C, call in C: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	lua_getglobal(L, "testfn");
	before = clock();
	for (int i = 0; i < 10000000; i++)
	{
		lua_pushvalue(L, -1);
		lua_pushinteger(L, 1);
		lua_pushinteger(L, 2);
		lua_call(L, 2, 0);
	}
	after = clock();
	lua_pop(L, 1);

	cout << "Loop in C, call into Lua: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	lua_getglobal(L, "ctestfn");
	before = clock();
	for (int i = 0; i < 10000000; i++)
	{
		lua_pushvalue(L, -1);
		lua_pushinteger(L, 1);
		lua_pushinteger(L, 2);
		lua_call(L, 2, 0);
	}
	after = clock();
	lua_pop(L, 1);

	cout << "Loop in C, call into Lua, then call into C: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	return 0;
}

測試的內容很簡單，寫一個函數，函數接收兩個參數a和b，並返回a+b的值。這裏面不考慮其他元方法和Lua字符串自動轉數字帶來的影響，單純的測試一下a+b調用的性能。

測試分別通過以下幾種不同調用方式進行，

循環寫在Lua裏，調用C函數、pcall調用C函數、xpcall調用C函數、調用Lua函數，pcall調用Lua函數，xpcall調用Lua函數、coroutine.resume/coroutine.yield調用Lua函數（由於Lua調用C函數時，在C函數內yield實質上下次調用是yieldk的那個“延續函數”，所以沒什麼必要測）

循環寫在C裏，調用C函數、調用Lua函數，以及通過Lua調用C函數.

循環次數爲一千萬次，運行結果如下：

Visual Studio 2019 Debug模式下編譯：

Loop in Lua, call into C: 3.146s
Loop in Lua, call into C (unprotected): 3.123s
Loop in Lua, call into C: (pcall) 8.4s
Loop in Lua, call into C: (xpcall) 9.562s
Loop in Lua, call in Lua: 1.84s
Loop in Lua, call in Lua (with pcall): 8.417s
Loop in Lua, call in Lua (with xpcall): 9.348s
Loop in Lua, call in Lua (with coroutine): 12.166s
Loop in C, call in C: 0.138s
Loop in C, call into Lua: 3.964s
Loop in C, call into Lua, then call into C: 3.965s

Visual Studio 2019 Release模式下編譯：

Loop in Lua, call into C: 0.423s
Loop in Lua, call into C (unprotected): 0.372s
Loop in Lua, call into C: (pcall) 0.803s
Loop in Lua, call into C: (xpcall) 0.929s
Loop in Lua, call in Lua: 0.489s
Loop in Lua, call in Lua (with pcall): 0.966s
Loop in Lua, call in Lua (with xpcall): 1.086s
Loop in Lua, call in Lua (with coroutine): 1.942s
Loop in C, call in C: 0s
Loop in C, call into Lua: 0.261s
Loop in C, call into Lua, then call into C: 0.194s

可以看到C原生(0.138s/0s)與Lua原生(1.84s/0.489s)之間還是有不小的性能差距的。至於C函數的0s有可能是編譯器主動優化掉了，但也不排除時間確實很短的可能性。

跨語言調用時，Debug模式下Lua調C速度比C調Lua速度要快一點，pcall和xpcall由於做了額外的保護模式操作所以要慢很多，coroutine不僅做了保護操作，還涉及到讓出時執行棧的保存和之後的恢復，所以要更慢一些。對於Release模式的數據感覺有點難以解釋，個人感覺最開始的Lua調用C的0.423s比後面C調用Lua的0.261s要少的原因可能是程序的預熱問題（猜測）。甚至說後面的C調用Lua再調用C所花的時間比單純的C調用Lua時間短更有可能是Lua VM的預熱。但是這些都只是推測，還沒法找到什麼讓人信服的理由。

經過一番測試之後，目前決定先轉向後一種設計：把事件隊列控制權交給Lua層來做，但是會在Lua層寫一個Library封裝一層提供給用戶代碼，這樣C層就不需要處理太多Lua相關的事情，只需要把消息按照規範推到Lua棧返回即可，同時也不用擔心用戶層會直接破壞掉事件隊列。這樣做的另一個好處是Lua層有了更大的操作空間，例如Lua層擁有事件隊列操作權之後，在沒有收到事件的空閒時間中可以調度並運行一些掛起的coroutine等等。

Lua C API調用性能測試

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Lua C API調用性能測試

Ubuntu Server 命令行下創建虛擬機

Ubuntu Server 18.04配置無線Wifi網卡

Python異常調用棧

移植SDL2程序到Android平臺

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結