Lua C API調用性能測試

最近自己做的一些小項目裏面用到了Lua和C API混合編程。在處理事件上有兩種設計,一種是在C層通過消息隊列接收消息並根據消息類型調用對應的Lua函數,並向Lua層提供AddListener這樣註冊回調的方法。另一種是直接將消息隊列方法暴露給Lua層,例如PushEvent,GetEvent等,然後在Lua層編寫一些代碼用來處理事件。最開始採用的是第一種方案,後來發現當消息量增多時會有一些卡頓,於是就想到是不是設計上帶來了一些性能缺陷,通過下面的代碼進行驗證:

int test_function(lua_State* L)
{
	int a = lua_tointeger(L, 1);
	int b = lua_tointeger(L, 2);
	int c = a + b;
	lua_pushinteger(L, c);
	return 1;
}

int test_in_c(int a, int b)
{
	return a + b;
}

int benchmark()
{
	LuaVM L;
	lua_pushcfunction(L, test_function);
	lua_setglobal(L, "ctestfn");

	luaL_loadstring(L, "for i=1, 10000000 do ctestfn(1, 2) end");

	clock_t before = clock();
	lua_pcall(L, 0, 0, -1);
	clock_t after = clock();

	cout << "Loop in Lua, call into C: " << ((double)after - before)/CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do ctestfn(1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call into C (unprotected): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do pcall(ctestfn, 1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call into C: (pcall) " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do xpcall(ctestfn, function() print(debug.traceback()) end, 1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call into C: (xpcall) " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	// Lua, Lua

	luaL_loadstring(L, "function testfn(a,b) return a+b end for i=1, 10000000 do testfn(1, 2) end");

	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do pcall(testfn, 1, 2) end");
	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua (with pcall): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "for i=1, 10000000 do xpcall(testfn, function() print(debug.traceback()) end, 1, 2) end");
	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua (with xpcall): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	luaL_loadstring(L, "x=coroutine.create(function(a,b) while true do a,b=coroutine.yield(a+b) end end) for i=1, 10000000 do coroutine.resume(x, 1, 2) end");
	before = clock();
	lua_call(L, 0, 0);
	after = clock();

	cout << "Loop in Lua, call in Lua (with coroutine): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	before = clock();
	for (int i = 0; i < 10000000; i++)
	{
		test_in_c(1, 2);
	}
	after = clock();
	
	cout << "Loop in C, call in C: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	lua_getglobal(L, "testfn");
	before = clock();
	for (int i = 0; i < 10000000; i++)
	{
		lua_pushvalue(L, -1);
		lua_pushinteger(L, 1);
		lua_pushinteger(L, 2);
		lua_call(L, 2, 0);
	}
	after = clock();
	lua_pop(L, 1);

	cout << "Loop in C, call into Lua: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	lua_getglobal(L, "ctestfn");
	before = clock();
	for (int i = 0; i < 10000000; i++)
	{
		lua_pushvalue(L, -1);
		lua_pushinteger(L, 1);
		lua_pushinteger(L, 2);
		lua_call(L, 2, 0);
	}
	after = clock();
	lua_pop(L, 1);

	cout << "Loop in C, call into Lua, then call into C: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;

	return 0;
}

測試的內容很簡單,寫一個函數,函數接收兩個參數a和b,並返回a+b的值。這裏面不考慮其他元方法和Lua字符串自動轉數字帶來的影響,單純的測試一下a+b調用的性能。

測試分別通過以下幾種不同調用方式進行,

循環寫在Lua裏,調用C函數、pcall調用C函數、xpcall調用C函數、調用Lua函數,pcall調用Lua函數,xpcall調用Lua函數、coroutine.resume/coroutine.yield調用Lua函數(由於Lua調用C函數時,在C函數內yield實質上 下次調用是yieldk的那個“延續函數”,所以沒什麼必要測)

循環寫在C裏,調用C函數、調用Lua函數,以及通過Lua調用C函數.

循環次數爲一千萬次,運行結果如下:

Visual Studio 2019 Debug模式下編譯:

Loop in Lua, call into C: 3.146s
Loop in Lua, call into C (unprotected): 3.123s
Loop in Lua, call into C: (pcall) 8.4s
Loop in Lua, call into C: (xpcall) 9.562s
Loop in Lua, call in Lua: 1.84s
Loop in Lua, call in Lua (with pcall): 8.417s
Loop in Lua, call in Lua (with xpcall): 9.348s
Loop in Lua, call in Lua (with coroutine): 12.166s
Loop in C, call in C: 0.138s
Loop in C, call into Lua: 3.964s
Loop in C, call into Lua, then call into C: 3.965s

Visual Studio 2019 Release模式下編譯:

Loop in Lua, call into C: 0.423s
Loop in Lua, call into C (unprotected): 0.372s
Loop in Lua, call into C: (pcall) 0.803s
Loop in Lua, call into C: (xpcall) 0.929s
Loop in Lua, call in Lua: 0.489s
Loop in Lua, call in Lua (with pcall): 0.966s
Loop in Lua, call in Lua (with xpcall): 1.086s
Loop in Lua, call in Lua (with coroutine): 1.942s
Loop in C, call in C: 0s
Loop in C, call into Lua: 0.261s
Loop in C, call into Lua, then call into C: 0.194s

可以看到C原生(0.138s/0s)與Lua原生(1.84s/0.489s)之間還是有不小的性能差距的。至於C函數的0s有可能是編譯器主動優化掉了,但也不排除時間確實很短的可能性。

跨語言調用時,Debug模式下Lua調C速度比C調Lua速度要快一點,pcall和xpcall由於做了額外的保護模式操作所以要慢很多,coroutine不僅做了保護操作,還涉及到讓出時執行棧的保存和之後的恢復,所以要更慢一些。對於Release模式的數據感覺有點難以解釋,個人感覺最開始的Lua調用C的0.423s比後面C調用Lua的0.261s要少的原因可能是程序的預熱問題(猜測)。甚至說後面的C調用Lua再調用C所花的時間比單純的C調用Lua時間短更有可能是Lua VM的預熱。但是這些都只是推測,還沒法找到什麼讓人信服的理由。

經過一番測試之後,目前決定先轉向後一種設計:把事件隊列控制權交給Lua層來做,但是會在Lua層寫一個Library封裝一層提供給用戶代碼,這樣C層就不需要處理太多Lua相關的事情,只需要把消息按照規範推到Lua棧返回即可,同時也不用擔心用戶層會直接破壞掉事件隊列。這樣做的另一個好處是Lua層有了更大的操作空間,例如Lua層擁有事件隊列操作權之後,在沒有收到事件的空閒時間中可以調度並運行一些掛起的coroutine等等。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章