Cython中Numpy的优化方法
参考官网: http://docs.cython.org/en/latest/src/userguide/numpy_tutorial.html#numpy-tutorial
git代码: https://github.com/chenyangMl/cython-pro/tree/master/Numpy_cython
Cython支持所有numpy常规的操作,即python中写的代码复制到pyx中,Cython可以顺利编译,但这样就失去了Cython提供的Numpy优化方法的意义。Cython目前可以通过如下方法来优化Numpy的效率。
类型内存示图
示例:
-
1 原始的compute_cy.pyx文件
import numpy as np def clip(a, min_value, max_value): return min(max(a, min_value), max_value) def compute(array_1, array_2, a, b, c): """ 该函数主要实现如下功能: np.clip(array_1, 2, 10) * a + array_2 * b + c array_1 and array_2 are 2D. """ x_max = array_1.shape[0] y_max = array_1.shape[1] assert array_1.shape == array_2.shape result = np.zeros((x_max, y_max), dtype=array_1.dtype) for x in range(x_max): for y in range(y_max): tmp = clip(array_1[x, y], 2, 10) tmp = tmp * a + array_2[x, y] * b result[x, y] = tmp + c return result
-
2 编写性能分析脚步cython_profile.py
import numpy as np array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc) array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc) a = 4 b = 3 c = 9 import compute_cy cProfile.runctx("compute_cy.compute(array_1,array_2,a,b,c)", globals(), locals={"array_1":array_1, "array_2":array_2, "a":a,"b":b,"c":c}, filename="Profile.prof") s = pstats.Stats("Profile.prof") s.strip_dirs().sort_stats("time").print_stats()
6000005 function calls in 26.935 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 13.955 13.955 26.934 26.934 compute_cy.pyx:7(compute) 6000000 12.979 0.000 12.979 0.000 compute_cy.pyx:4(clip) 1 0.001 0.001 26.935 26.935 <string>:1(<module>) 1 0.000 0.000 26.935 26.935 {built-in method builtins.exec} 1 0.000 0.000 26.934 26.934 {compute_cy.compute} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Type 备注 ncalls 函数被调用的次数 tottime 函数内部消耗的总时间 percall 函数的平均调用时间,tottime/ncalls cumtime 之前所有子函数消费时间的累计和 filename:lineno(function) 被分析函数所在文件名、行号、函数名。 结果显示:
-
3 通过指定数据类型优化
#touch compute_typed.pyx import numpy as np DTYPE = np.intc # cdef means here that this function is a plain C function (so faster). # To get all the benefits, we type the arguments and the return value. cdef int clip(int a, int min_value, int max_value): return min(max(a, min_value), max_value) def compute(array_1, array_2, int a, int b, int c): cdef int x_max = array_1.shape[0] cdef int y_max = array_1.shape[1] assert array_1.shape == array_2.shape assert array_1.dtype == DTYPE assert array_2.dtype == DTYPE result = np.zeros((x_max, y_max), dtype=DTYPE) cdef int tmp # Py_ssize_t is the proper C type for Python array indices. cdef int x, y for x in range(x_max): for y in range(y_max): tmp = clip(array_1[x, y], 2, 10) tmp = tmp * a + array_2[x, y] * b result[x, y] = tmp + c return result
修改下性能分析脚步cython_proflie.py中的内容,替换如下部分,再进行性能分析
import compute_typed cProfile.runctx("compute_typed.compute(array_1,array_2,a,b,c)", globals(), locals={"array_1":array_1, "array_2":array_2, "a":a,"b":b,"c":c}, filename="Profile.prof")
#未指定数据类型的性能分析结果 6000005 function calls in 26.935 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 13.955 13.955 26.934 26.934 compute_cy.pyx:7(compute) 6000000 12.979 0.000 12.979 0.000 compute_cy.pyx:4(clip) 1 0.001 0.001 26.935 26.935 <string>:1(<module>) 1 0.000 0.000 26.935 26.935 {built-in method builtins.exec} 1 0.000 0.000 26.934 26.934 {compute_cy.compute} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} #指定数据类型的性能分析结果 4 function calls in 11.977 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 11.976 11.976 11.976 11.976 {compute_typed.compute} 1 0.001 0.001 11.977 11.977 <string>:1(<module>) 1 0.000 0.000 11.977 11.977 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
对比分析: 1:首先从总消耗时间时间来看指定数据类型后效率有了大幅的提升。
2:可以看到指定数据类型后,函数的调用次数有了量级的下降。
-
4 通过内存视图优化
内存视图(memoryviews)数据类型:Cython提供的一种用于声明数组的C结构体类型,主要通过指针维护ndarry的元数据(metadata),同时支持数组的dimensions, strides, item size, item type information, slices操作。简而言之就是通过指定这种数据类型也nadarry的大部分类似操作。
内存视图声明方法:
#声明一个存放int类型的1D,2D,3D数组。 cdef int [:] foo # 1D memoryview cdef int [:, :] foo # 2D memoryview cdef int [:, :, :] foo # 3D memoryview
新建一个compute.memview.pyx的文件,在3的基础上使用内存视图优化ndarry
#touch compute.memview.pyx import numpy as np DTYPE = np.intc cdef int clip(int a, int min_value, int max_value): return min(max(a, min_value), max_value) def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c): cdef int x_max = array_1.shape[0] cdef int y_max = array_1.shape[1] # array_1.shape is now a C array, no it's not possible # to compare it simply by using == without a for-loop. # To be able to compare it to array_2.shape easily, # we convert them both to Python tuples. assert tuple(array_1.shape) == tuple(array_2.shape) result = np.zeros((x_max, y_max), dtype=DTYPE) cdef int[:, :] result_view = result cdef int tmp cdef int x, y for x in range(x_max): for y in range(y_max): tmp = clip(array_1[x, y], 2, 10) tmp = tmp * a + array_2[x, y] * b result_view[x, y] = tmp + c
修改下性能分析脚步cython_proflie.py中的内容,将compute_typed替换为compute_menview即可,再进行性能分析。
#指定数据类型的性能分析结果 4 function calls in 11.977 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 11.976 11.976 11.976 11.976 {compute_typed.compute} 1 0.001 0.001 11.977 11.977 <string>:1(<module>) 1 0.000 0.000 11.977 11.977 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} #使用内存视图优化后的性能分析结果 4 function calls in 0.017 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.017 0.017 0.017 0.017 {compute_memview.compute} 1 0.000 0.000 0.017 0.017 {built-in method builtins.exec} 1 0.000 0.000 0.017 0.017 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
对比分析: 通过指定ndarray为C类型的数据结构,代码的执行效率有了质的飞跃。
-
5 多线程(没测试通暂不说明)