Cython中Numpy的優化方法
參考官網: http://docs.cython.org/en/latest/src/userguide/numpy_tutorial.html#numpy-tutorial
git代碼: https://github.com/chenyangMl/cython-pro/tree/master/Numpy_cython
Cython支持所有numpy常規的操作,即python中寫的代碼複製到pyx中,Cython可以順利編譯,但這樣就失去了Cython提供的Numpy優化方法的意義。Cython目前可以通過如下方法來優化Numpy的效率。
類型內存示圖
示例:
-
1 原始的compute_cy.pyx文件
import numpy as np def clip(a, min_value, max_value): return min(max(a, min_value), max_value) def compute(array_1, array_2, a, b, c): """ 該函數主要實現如下功能: np.clip(array_1, 2, 10) * a + array_2 * b + c array_1 and array_2 are 2D. """ x_max = array_1.shape[0] y_max = array_1.shape[1] assert array_1.shape == array_2.shape result = np.zeros((x_max, y_max), dtype=array_1.dtype) for x in range(x_max): for y in range(y_max): tmp = clip(array_1[x, y], 2, 10) tmp = tmp * a + array_2[x, y] * b result[x, y] = tmp + c return result
-
2 編寫性能分析腳步cython_profile.py
import numpy as np array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc) array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc) a = 4 b = 3 c = 9 import compute_cy cProfile.runctx("compute_cy.compute(array_1,array_2,a,b,c)", globals(), locals={"array_1":array_1, "array_2":array_2, "a":a,"b":b,"c":c}, filename="Profile.prof") s = pstats.Stats("Profile.prof") s.strip_dirs().sort_stats("time").print_stats()
6000005 function calls in 26.935 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 13.955 13.955 26.934 26.934 compute_cy.pyx:7(compute) 6000000 12.979 0.000 12.979 0.000 compute_cy.pyx:4(clip) 1 0.001 0.001 26.935 26.935 <string>:1(<module>) 1 0.000 0.000 26.935 26.935 {built-in method builtins.exec} 1 0.000 0.000 26.934 26.934 {compute_cy.compute} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Type 備註 ncalls 函數被調用的次數 tottime 函數內部消耗的總時間 percall 函數的平均調用時間,tottime/ncalls cumtime 之前所有子函數消費時間的累計和 filename:lineno(function) 被分析函數所在文件名、行號、函數名。 結果顯示:
-
3 通過指定數據類型優化
#touch compute_typed.pyx import numpy as np DTYPE = np.intc # cdef means here that this function is a plain C function (so faster). # To get all the benefits, we type the arguments and the return value. cdef int clip(int a, int min_value, int max_value): return min(max(a, min_value), max_value) def compute(array_1, array_2, int a, int b, int c): cdef int x_max = array_1.shape[0] cdef int y_max = array_1.shape[1] assert array_1.shape == array_2.shape assert array_1.dtype == DTYPE assert array_2.dtype == DTYPE result = np.zeros((x_max, y_max), dtype=DTYPE) cdef int tmp # Py_ssize_t is the proper C type for Python array indices. cdef int x, y for x in range(x_max): for y in range(y_max): tmp = clip(array_1[x, y], 2, 10) tmp = tmp * a + array_2[x, y] * b result[x, y] = tmp + c return result
修改下性能分析腳步cython_proflie.py中的內容,替換如下部分,再進行性能分析
import compute_typed cProfile.runctx("compute_typed.compute(array_1,array_2,a,b,c)", globals(), locals={"array_1":array_1, "array_2":array_2, "a":a,"b":b,"c":c}, filename="Profile.prof")
#未指定數據類型的性能分析結果 6000005 function calls in 26.935 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 13.955 13.955 26.934 26.934 compute_cy.pyx:7(compute) 6000000 12.979 0.000 12.979 0.000 compute_cy.pyx:4(clip) 1 0.001 0.001 26.935 26.935 <string>:1(<module>) 1 0.000 0.000 26.935 26.935 {built-in method builtins.exec} 1 0.000 0.000 26.934 26.934 {compute_cy.compute} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} #指定數據類型的性能分析結果 4 function calls in 11.977 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 11.976 11.976 11.976 11.976 {compute_typed.compute} 1 0.001 0.001 11.977 11.977 <string>:1(<module>) 1 0.000 0.000 11.977 11.977 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
對比分析: 1:首先從總消耗時間時間來看指定數據類型後效率有了大幅的提升。
2:可以看到指定數據類型後,函數的調用次數有了量級的下降。
-
4 通過內存視圖優化
內存視圖(memoryviews)數據類型:Cython提供的一種用於聲明數組的C結構體類型,主要通過指針維護ndarry的元數據(metadata),同時支持數組的dimensions, strides, item size, item type information, slices操作。簡而言之就是通過指定這種數據類型也nadarry的大部分類似操作。
內存視圖聲明方法:
#聲明一個存放int類型的1D,2D,3D數組。 cdef int [:] foo # 1D memoryview cdef int [:, :] foo # 2D memoryview cdef int [:, :, :] foo # 3D memoryview
新建一個compute.memview.pyx的文件,在3的基礎上使用內存視圖優化ndarry
#touch compute.memview.pyx import numpy as np DTYPE = np.intc cdef int clip(int a, int min_value, int max_value): return min(max(a, min_value), max_value) def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c): cdef int x_max = array_1.shape[0] cdef int y_max = array_1.shape[1] # array_1.shape is now a C array, no it's not possible # to compare it simply by using == without a for-loop. # To be able to compare it to array_2.shape easily, # we convert them both to Python tuples. assert tuple(array_1.shape) == tuple(array_2.shape) result = np.zeros((x_max, y_max), dtype=DTYPE) cdef int[:, :] result_view = result cdef int tmp cdef int x, y for x in range(x_max): for y in range(y_max): tmp = clip(array_1[x, y], 2, 10) tmp = tmp * a + array_2[x, y] * b result_view[x, y] = tmp + c
修改下性能分析腳步cython_proflie.py中的內容,將compute_typed替換爲compute_menview即可,再進行性能分析。
#指定數據類型的性能分析結果 4 function calls in 11.977 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 11.976 11.976 11.976 11.976 {compute_typed.compute} 1 0.001 0.001 11.977 11.977 <string>:1(<module>) 1 0.000 0.000 11.977 11.977 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} #使用內存視圖優化後的性能分析結果 4 function calls in 0.017 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.017 0.017 0.017 0.017 {compute_memview.compute} 1 0.000 0.000 0.017 0.017 {built-in method builtins.exec} 1 0.000 0.000 0.017 0.017 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
對比分析: 通過指定ndarray爲C類型的數據結構,代碼的執行效率有了質的飛躍。
-
5 多線程(沒測試通暫不說明)