Cython--通過內存視圖優化numpy(三)

Cython中Numpy的優化方法

參考官網: http://docs.cython.org/en/latest/src/userguide/numpy_tutorial.html#numpy-tutorial
git代碼: https://github.com/chenyangMl/cython-pro/tree/master/Numpy_cython

Cython支持所有numpy常規的操作,即python中寫的代碼複製到pyx中,Cython可以順利編譯,但這樣就失去了Cython提供的Numpy優化方法的意義。Cython目前可以通過如下方法來優化Numpy的效率。

類型內存示圖

示例:

  • 1 原始的compute_cy.pyx文件

    import numpy as np
    
    def clip(a, min_value, max_value):
        return min(max(a, min_value), max_value)
    
    def compute(array_1, array_2, a, b, c):
        """
        該函數主要實現如下功能:
        np.clip(array_1, 2, 10) * a + array_2 * b + c
    
        array_1 and array_2 are 2D.
        """
        x_max = array_1.shape[0]
        y_max = array_1.shape[1]
    
        assert array_1.shape == array_2.shape
    
        result = np.zeros((x_max, y_max), dtype=array_1.dtype)
    
        for x in range(x_max):
            for y in range(y_max):
                tmp = clip(array_1[x, y], 2, 10)
                tmp = tmp * a + array_2[x, y] * b
                result[x, y] = tmp + c
    
        return result
    
  • 2 編寫性能分析腳步cython_profile.py

    import numpy as np
    array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)
    array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)
    a = 4
    b = 3
    c = 9
    import compute_cy
    
    cProfile.runctx("compute_cy.compute(array_1,array_2,a,b,c)",
                    globals(),
                    locals={"array_1":array_1,
                                       "array_2":array_2,
                                       "a":a,"b":b,"c":c},
                                filename="Profile.prof")
    
    s = pstats.Stats("Profile.prof")
    s.strip_dirs().sort_stats("time").print_stats()
    
             6000005 function calls in 26.935 seconds
       Ordered by: internal time
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1   13.955   13.955   26.934   26.934 compute_cy.pyx:7(compute)
      6000000   12.979    0.000   12.979    0.000 compute_cy.pyx:4(clip)
            1    0.001    0.001   26.935   26.935 <string>:1(<module>)
            1    0.000    0.000   26.935   26.935 {built-in method builtins.exec}
            1    0.000    0.000   26.934   26.934 {compute_cy.compute}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    
    Type 備註
    ncalls 函數被調用的次數
    tottime 函數內部消耗的總時間
    percall 函數的平均調用時間,tottime/ncalls
    cumtime 之前所有子函數消費時間的累計和
    filename:lineno(function) 被分析函數所在文件名、行號、函數名。

    結果顯示:

  • 3 通過指定數據類型優化

    #touch compute_typed.pyx
    import numpy as np
    
    DTYPE = np.intc
    
    # cdef means here that this function is a plain C function (so faster).
    # To get all the benefits, we type the arguments and the return value.
    cdef int clip(int a, int min_value, int max_value):
        return min(max(a, min_value), max_value)
    
    def compute(array_1, array_2, int a, int b, int c):
    
        cdef int x_max = array_1.shape[0]
        cdef int y_max = array_1.shape[1]
    
        assert array_1.shape == array_2.shape
        assert array_1.dtype == DTYPE
        assert array_2.dtype == DTYPE
    
        result = np.zeros((x_max, y_max), dtype=DTYPE)
    
        cdef int tmp
    
        # Py_ssize_t is the proper C type for Python array indices.
        cdef int x, y
    
        for x in range(x_max):
            for y in range(y_max):
    
                tmp = clip(array_1[x, y], 2, 10)
                tmp = tmp * a + array_2[x, y] * b
                result[x, y] = tmp + c
    
        return result
    
    

    修改下性能分析腳步cython_proflie.py中的內容,替換如下部分,再進行性能分析

    import compute_typed
    cProfile.runctx("compute_typed.compute(array_1,array_2,a,b,c)",
                    globals(),
                    locals={"array_1":array_1,
                            "array_2":array_2,
                            "a":a,"b":b,"c":c},
                            filename="Profile.prof")
    
    #未指定數據類型的性能分析結果       
            6000005 function calls in 26.935 seconds
       Ordered by: internal time
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1   13.955   13.955   26.934   26.934 compute_cy.pyx:7(compute)
      6000000   12.979    0.000   12.979    0.000 compute_cy.pyx:4(clip)
            1    0.001    0.001   26.935   26.935 <string>:1(<module>)
            1    0.000    0.000   26.935   26.935 {built-in method builtins.exec}
            1    0.000    0.000   26.934   26.934 {compute_cy.compute}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
            
    #指定數據類型的性能分析結果  
    		4 function calls in 11.977 seconds
       Ordered by: internal time
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1   11.976   11.976   11.976   11.976 {compute_typed.compute}
            1    0.001    0.001   11.977   11.977 <string>:1(<module>)
            1    0.000    0.000   11.977   11.977 {built-in method builtins.exec}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    

    對比分析: 1:首先從總消耗時間時間來看指定數據類型後效率有了大幅的提升。

    ​ 2:可以看到指定數據類型後,函數的調用次數有了量級的下降。

  • 4 通過內存視圖優化

    內存視圖(memoryviews)數據類型:Cython提供的一種用於聲明數組的C結構體類型,主要通過指針維護ndarry的元數據(metadata),同時支持數組的dimensions, strides, item size, item type information, slices操作。簡而言之就是通過指定這種數據類型也nadarry的大部分類似操作。

    內存視圖聲明方法:

    #聲明一個存放int類型的1D,2D,3D數組。
    cdef int [:] foo         # 1D memoryview
    cdef int [:, :] foo      # 2D memoryview
    cdef int [:, :, :] foo   # 3D memoryview
    

    新建一個compute.memview.pyx的文件,在3的基礎上使用內存視圖優化ndarry

    #touch compute.memview.pyx
    import numpy as np
    
    DTYPE = np.intc
    
    cdef int clip(int a, int min_value, int max_value):
        return min(max(a, min_value), max_value)
    
    def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c):
    
        cdef int x_max = array_1.shape[0]
        cdef int y_max = array_1.shape[1]
    
        # array_1.shape is now a C array, no it's not possible
        # to compare it simply by using == without a for-loop.
        # To be able to compare it to array_2.shape easily,
        # we convert them both to Python tuples.
        assert tuple(array_1.shape) == tuple(array_2.shape)
    
        result = np.zeros((x_max, y_max), dtype=DTYPE)
        cdef int[:, :] result_view = result
    
        cdef int tmp
        cdef int x, y
    
        for x in range(x_max):
            for y in range(y_max):
    
                tmp = clip(array_1[x, y], 2, 10)
                tmp = tmp * a + array_2[x, y] * b
                result_view[x, y] = tmp + c
    
    

    修改下性能分析腳步cython_proflie.py中的內容,將compute_typed替換爲compute_menview即可,再進行性能分析。

    #指定數據類型的性能分析結果  
    		4 function calls in 11.977 seconds
       Ordered by: internal time
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1   11.976   11.976   11.976   11.976 {compute_typed.compute}
            1    0.001    0.001   11.977   11.977 <string>:1(<module>)
            1    0.000    0.000   11.977   11.977 {built-in method builtins.exec}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    #使用內存視圖優化後的性能分析結果        
             4 function calls in 0.017 seconds
       Ordered by: internal time
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.017    0.017    0.017    0.017 {compute_memview.compute}
            1    0.000    0.000    0.017    0.017 {built-in method builtins.exec}
            1    0.000    0.000    0.017    0.017 <string>:1(<module>)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}        
    

    對比分析: 通過指定ndarray爲C類型的數據結構,代碼的執行效率有了質的飛躍。

  • 5 多線程(沒測試通暫不說明)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章