Python筆記---一腳踏進函數式編程


關於什麼是函數式編程,百度一下可以找到很多解釋,這裏放上我在Wiki上找的一段:

functional programming is a programming paradigm—a style of building the structure and elements of computer programs—that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data.

簡單點理解,函數式編程和麪向對象編程、面向過程編程一樣,是一種編程方式,它反映了一種映射思想,又有點類似建模思想。從輸入到輸出,輸入數據不可被更改,最終返回輸出結果。它非常非常的依賴不可變的數據結構,這使得使用這種編程方式可以減少錯誤的可能性,並確保程序更易於維護。

之前在知乎上看見一篇不錯的帖子,各抒己見,有興趣可以去看看:什麼是函數式編程思維?

以下通過簡單的示例代碼,說一下函數式編程:

不可變數據結構

函數式編程操作的是不可變數據,在基於數據構建數據結構時,可以使用Python的內置模塊collections中的namedtulple()函數來創建不可變的數據結構。

import collections

Scientist = collections.namedtuple('Scientist', [
    'name',
    'field',
    'born',
    'nobel'
])

__scientists = (
    Scientist(name = 'Ada Lovelace', field = 'math', born = 1815, nobel = False),
    Scientist(name = 'Emmy Noether', field = 'math', born = 1882, nobel = False),
    Scientist(name = 'Marie Curie', field = 'math', born = 1867, nobel = True),
    Scientist(name = 'Tu Youyou', field = 'physics', born = 1930, nobel = True),
    Scientist(name = 'Ada Yonath', field = 'chemistry', born = 1939, nobel = True),
    Scientist(name = 'Vera Rubin', field = 'astronomy', born = 1928, nobel = False),
    Scientist(name = 'Sally Ride', field = 'physics', born = 1951, nobel = False)
)

def getData():
    return __scientists

上述代碼創建了一個不可變的數據結構並實例化,並提供getData()函數供外部程序調用。這種數據結構在並行計算中很重要,因爲它不能被修改,所以,會避免因爲數據被更改導致被鎖。
再仔細觀察一下代碼,看看整個創建過程是如何實現的。
首先利用namedtuple創建一個tuple的子類,從namedtuple字面翻譯可以理解爲具名元組,該子類的名字爲Scientist,子類裏具有的字段爲‘name’,‘field’,‘born’,‘nobel’。

Scientist = collections.namedtuple('Scientist', [
    'name',
    'field',
    'born',
    'nobel'
])

然後創建元組__scientists,元組中的每個元素,均爲實例化的Scientist類。

__scientists = (
    Scientist(name = 'Ada Lovelace', field = 'math', born = 1815, nobel = False),
    Scientist(name = 'Emmy Noether', field = 'math', born = 1882, nobel = False),
    Scientist(name = 'Marie Curie', field = 'math', born = 1867, nobel = True),
    Scientist(name = 'Tu Youyou', field = 'physics', born = 1930, nobel = True),
    Scientist(name = 'Ada Yonath', field = 'chemistry', born = 1939, nobel = True),
    Scientist(name = 'Vera Rubin', field = 'astronomy', born = 1928, nobel = False),
    Scientist(name = 'Sally Ride', field = 'physics', born = 1951, nobel = False)
)

這麼做的目的或者好處,之前已經反覆提到,就是爲了創建一個不可變的數據結構。爲了對比,我們可以在Python自帶的IDLE裏進行簡單的測試:
在以下的測試中,我將分別使用上述方法創建不可變數據類型Scientist和字典類型的Scientist_dict,並對其中的元素進行更改操作。
首先是Scientist類:

# 創建Scientist類
>>> Scientist = namedtuple('Scientist',[
	'name',
	'field',
	'born',
	'nobel'
	])
>>> Scientist
<class '__main__.Scientist'>
#創建實例:
>>> ada =  Scientist(
		'Ada Lovelace',
		'math',1815,False
		)
#獲取實例特定字段數據
>>> ada.name
'Ada Lovelace'
>>> ada.field
'math'
#更改實例變量數據,會提示錯誤信息,數據不可更改。
>>> ada.name = 'Ed Lovelace'
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    ada.name = 'Ed Lovelace'
AttributeError: can't set attribute

創建字典:Scientists_dict_ada

>>> Scientists_dict_ada={
		'name':'Ada Lovelace',
		'field':'math',
		'born':1815,
		'nobel':False
		}
#獲取特定字段的數據
>>> Scientists_dict_ada['name']
'Ada Lovelace'
#更改特定字段的數據
>>> Scientists_dict_ada['name']='Ed Lovelace'
>>> Scientists_dict_ada['name']
'Ed Lovelace'		

可以看到,如果定義字典,則數據可以很容易被更改。

上述測試是針對一組數據,如果多組數據,通常我們會將數據放入元組,因爲元組本身是不可變數據類型,但是如果元組內的元素是可變的數據類型,則元組內的某個元素的數據是可以被更改的。基於此,我們繼續進行測試,我們再分別創建一組數據,並把它們分別放入元組,代碼如下:

#創建Scientist實例構成的元組
>>> __Scientists = (
	Scientist('Ada Lovelace','math',1815,False),
	Scientist('Ada1 Lovelace','math',1815,True)
	)
>>> __Scientists
(Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False), Scientist(name='Ada1 Lovelace', field='math', born=1815, nobel=True))
#創建由字典構成的元組
>>> Scientists_dict = (
	{'name':'Ada Lovelace','field':'math','born':1815,'nobel':False},
	{'nime':'Ada1 Lovelace','field':'math','born':1815,'nobel':True}
	)
>>> Scientists_dict
({'name': 'Ada Lovelace', 'field': 'math', 'born': 1815, 'nobel': False}, {'nime': 'Ada1 Lovelace', 'field': 'math', 'born': 1815, 'nobel': True})

注意在第二個字典數據裏,我將字段’name’改寫成了’nime’,但是程序並沒有識別出問題,依然正常錄入。這種情況在上邊的Scientist的實例裏是不會出現的。
繼續對元組裏的數據進行更改操作:

#對__Scientists元素進行更改
>>> __Scientists[0]
Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False)
>>> __Scientists[0]['name']='ada'
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    __Scientists[0]['name']='ada'
TypeError: 'Scientist' object does not support item assignment
#對Scientist_dict元素進行更改
>>> Scientists_dict[0]
{'name': 'Ada Lovelace', 'field': 'math', 'born': 1815, 'nobel': False}
>>> Scientists_dict[0]['name']
'Ada Lovelace'
>>> Scientists_dict[0]['name']='ada'
>>> Scientists_dict[0]['name']
'ada'

可以看見,上述操作,因爲Scientist的原因無法對__Scientists進行更改,但是卻可以對Scientist_dict裏的元素進行更改。

那麼再思考一個問題,如果我們不把數據放入元組,而是放入可變的數據類型中,會怎麼樣呢?
依然通過一個簡單的示例說明,代碼如下:

#創建一個帶有Scientist元素的數組
>>> __Scientists_with_list = [
	Scientist('Ada Lovelace','math',1815,False),
	Scientist('Ada1 Lovelace','math',1815,True)
	]
>>> __Scientists_with_list
[Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False), Scientist(name='Ada1 Lovelace', field='math', born=1815, nobel=True)]
#更改數組內元素的字段
>>> __Scientists_with_list[0]['name']='ada'
Traceback (most recent call last):
  File "<pyshell#47>", line 1, in <module>
    __Scientists_with_list[0]['name']='ada'
TypeError: 'Scientist' object does not support item assignment
#刪除數組內元素
>>> del __Scientists_with_list[0]
>>> __Scientists_with_list
[Scientist(name='Ada1 Lovelace', field='math', born=1815, nobel=True)]

上述的測試可以看見,雖然數組內的元素的某個字段由於Scientist的原因依然不可以被更改,但是,因爲數組本身的原因,數組內的元素可以被刪除。

經過以上一系列的對比測試可以發現,元組+namedtuple()構建的不可變數據結構在保護數據安全上的優勢。

回到我們的主題,這種不可變的數據結構也是函數式編程的重要組成部分。

三個基本函數

下邊介紹函數式編程的三個基本函數:我們使用上一小節創建的不可變數據來進行接下來的測試

filter()

filter(function or None, iterable) --> filter object

filter()函數接收兩個參數,第一個是函數,第二個是序列。通常情況下,可以通過Lambda表達式直接定義函數作爲第一個參數。
我們先做一個過濾函數,篩選出獲得過諾貝爾獎的人,即nobel = True:

#引入pprint,打印輸出數據結構更加完
>>> from pprint import pprint
#打印輸出目前__scientists數據,方便查看
>>> pprint(tuple(__scientists))
(Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
 Scientist(name='Emmy Noether', field='math', born=1882, nobel=False),
 Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True),
 Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False),
 Scientist(name='Sally Ride', field='physics', born=1951, nobel=False))
#定義篩選函數
>>> def is_nobel(X):
		return X.nobel
#將函數作爲參數,傳給filter函數,結果返回給fs
>>> fs = filter(is_nobel, __scientists)
>>> pprint(tuple(fs))
(Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True))

上述代碼,可以看出,is_nobel作爲參數傳入filter,只傳函數名,之後遍歷__scientists()中的元素,當返回值爲True時,存入fs。
很顯然,這種單獨定義一個簡單函數的方式很麻煩。Lambda表達式可以很好的解決這個問題。
使用Lambda表達式完成上述過濾的代碼如下:

>>> fs = tuple(filter(lambda x:x.nobel is True, __scientists))
>>> pprint(tuple(fs))
(Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True))

再做一些練習來熟悉filter()

  1. 當條件恆爲True時,返回全部元素:
>>> fs = tuple(filter(lambda x: True, __scientists))
>>> pprint(fs)
(Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
 Scientist(name='Emmy Noether', field='math', born=1882, nobel=False),
 Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True),
 Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False),
 Scientist(name='Sally Ride', field='physics', born=1951, nobel=False))
  1. 當第一個參數即函數爲None時,返回全部元素:
>>> fs = tuple(filter(None,__scientists))
>>> pprint(fs)
(Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
 Scientist(name='Emmy Noether', field='math', born=1882, nobel=False),
 Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True),
 Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False),
 Scientist(name='Sally Ride', field='physics', born=1951, nobel=False))
  1. 過濾滿足多個條件的元素:
>>> fs = tuple(filter(lambda x: x.field == 'physics' and x.nobel == True, __scientists))
>>> pprint(fs)
(Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),)

上述代碼實現了field爲’physics’,nobel爲True的過濾。

再做一些思考:

我們通過了一系列的代碼對filter()有了一定的瞭解,簡單點說,我們之前用filter做了一件事,就是把符合條件的元素,從給定的一堆元素中篩選出來。這種需求,在python裏,不要忘記另一種方法:列表解析式

>>> fs = tuple([x for x in __scientists if x.nobel is True])
>>> pprint(fs)
(Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True))

可以看見,列表解析式也可以很方便的過濾出所求。因爲主要是講filter(),所以在這裏只是提一下,就不再展開了。

map()

map(func, *iterables) --> map object

map()函數對指定的序列做映射處理。我認爲,通俗點說,它的主要作用,是在原不可變數據結構的基礎上,通過函數轉化生成新的數據結構的數據。
map()接收兩個參數,一個是函數,一個是可迭代對象。依然以之前的數據,舉例說明map()的應用:
現在通過map(),創建一個新的數據結構的數據,其中只包含原數據裏的名字,已經年齡(現在-出生),代碼如下:

>>>res = tuple(map(
	lambda x: {
	'name':x.name,
	'age':2020-x.born},
	__scientists))
>>> pprint(res)
({'age': 205, 'name': 'Ada Lovelace'},
 {'age': 138, 'name': 'Emmy Noether'},
 {'age': 153, 'name': 'Marie Curie'},
 {'age': 90, 'name': 'Tu Youyou'},
 {'age': 81, 'name': 'Ada Yonath'},
 {'age': 92, 'name': 'Vera Rubin'},
 {'age': 69, 'name': 'Sally Ride'})

上述代碼中,使用了Lambda表達式作爲第一個參數,__scientists依次將數據傳入Lambda表達式進行運算,並將結果映射到新的數據上。正如之前所說,它實際上就是完成了一個基於現有不可變數據,創建了一個新的數據結構的數據集。數據集具體有哪些數據,通過Lambda表達式得出。寫成數學形式,就好像映射關係:
新數據 = f(舊數據)

再做一些思考:

和filter()函數類似,map()函數其實也可以使用列表解析式實現,當然,也可以使用生成器表達式。生成器表達式完成上述功能代碼如下:

>>> res = tuple({'name':x.name,'age':2020-x.born} for x in __scientists)
>>> pprint(res)
({'age': 205, 'name': 'Ada Lovelace'},
 {'age': 138, 'name': 'Emmy Noether'},
 {'age': 153, 'name': 'Marie Curie'},
 {'age': 90, 'name': 'Tu Youyou'},
 {'age': 81, 'name': 'Ada Yonath'},
 {'age': 92, 'name': 'Vera Rubin'},
 {'age': 69, 'name': 'Sally Ride'})

其實一般情況下,這種使用生成器表達式的方式,更符合Python的編程思想。就是常說的Pythonic。

再再做一些思考:

既然列表解析式或者生成器表達式已經可以完成需求,爲什麼有的時候依然使用map()?個人理解:雖然列表解析式和生成器表達式更加的直觀和Pythonic,但是,當不使用Lambda表達式而是使用同一個函數時,map()的效率會更高。舉個簡單例子:

>>> timeit.timeit('map(hex,xs)','xs=range(10)')
0.33299910000005184
>>> timeit.timeit('(hex(i) for i in xs)','xs=range(10)')
0.5141750999996475

同樣的功能,map函數性能更高。這在處理海量數據,並行計算時,很重要。

reduce()

reduce(function, sequence[, initial]) -> value

reduce()函數可以接收三個參數,前兩個是函數function和序列sequence,第三個是初始值initial(可選)。注意,reduce()函數接收含有兩個參數的function,一個參數作爲累加器,另一個是序列中的元素。也就是說,sequence中的元素依次作爲function的第二個參數傳入function進行計算。結果放入第一個參數,用做下一次計算。最終function將第一個參數作爲reduce的返回值。用一個簡單的例子說明:

>>> from functools import reduce
>>> res = reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
>>> res
15

上述例子中,reduce執行了一個累加的操作。y值分別對應數組中的1,2,3,4,5。因爲x沒有指定初始值,python會自動識別,並附初值0。所以結果是15。
如果給x附初值3,則:

>>> res = reduce(lambda x, y: x+y, [1, 2, 3, 4, 5],3)
>>> res
18

再舉一個例子,說明當reduce沒有指定初始值時,python會自動識別的例子。我們將數組元素改爲字符串。

>>> res = reduce(lambda x,y:x+y,['1','2','3','4','5'])
>>> res
'12345'

如果賦初始值’res’,則:

>>> res = reduce(lambda x,y:x+y,['1','2','3','4','5'],'res')
>>> res
'res12345'

上述代碼可以更清晰的理解函數的運算過程。

再次注意的是:

如果我們將reduce()接收的三個參數分別命名爲:function,sequence,initial(可選)
function接收兩個參數分別爲:acc,item。
則sequence只爲item提供數據,acc初始值可以initial給定,也可以系統默認。在遍歷完全部sequence後,最終acc作爲reduce()的返回值返回。

理解了reduce()函數基本原理後,我們依然可以基於之前的數據進行操作。爲了方便回憶,將之前的數據創建代碼給出:

>>> import collections
>>> Scientist = collections.namedtuple('Scientist', [
    'name',
    'field',
    'born',
    'nobel'
])
>>> __scientists = (
    Scientist(name = 'Ada Lovelace', field = 'math', born = 1815, nobel = False),
    Scientist(name = 'Emmy Noether', field = 'math', born = 1882, nobel = False),
    Scientist(name = 'Marie Curie', field = 'math', born = 1867, nobel = True),
    Scientist(name = 'Tu Youyou', field = 'physics', born = 1930, nobel = True),
    Scientist(name = 'Ada Yonath', field = 'chemistry', born = 1939, nobel = True),
    Scientist(name = 'Vera Rubin', field = 'astronomy', born = 1928, nobel = False),
    Scientist(name = 'Sally Ride', field = 'physics', born = 1951, nobel = False)
)

在map()函數的練習裏,我們創建了一個新的帶有名字和年齡的數據,代碼如下:

>>> name_and_age = tuple({'name':x.name,'age':2020-x.born} for x in __scientists)
>>> pprint(name_and_age)
({'age': 205, 'name': 'Ada Lovelace'},
 {'age': 138, 'name': 'Emmy Noether'},
 {'age': 153, 'name': 'Marie Curie'},
 {'age': 90, 'name': 'Tu Youyou'},
 {'age': 81, 'name': 'Ada Yonath'},
 {'age': 92, 'name': 'Vera Rubin'},
 {'age': 69, 'name': 'Sally Ride'})

現在我們基於name_and_age數據,獲取所有人的年齡總和。代碼如下:

>>> sum_ages = reduce(
		lambda acc,item: acc+item['age'], 
		name_and_age, 
		0)
>>> sum_ages
828

依然可以使用其他方法來實現上訴功能:

>>> sum_ages_ = sum(x['age'] for x in name_and_age)
>>> sum_ages_
828

當然,reduce()函數可以實現的功能遠不止於此。看下邊的代碼:

>>> def reducer(acc,item):
	acc[item.field].append(item.name)
	return acc

>>> scientists_by_field = reduce(
	reducer,
	__scientists,
	{'math':[], 'physics':[], 'chemistry':[], 'astronomy':[]}
	)
>>> pprint(scientists_by_field)
{'astronomy': ['Vera Rubin'],
 'chemistry': ['Ada Yonath'],
 'math': ['Ada Lovelace', 'Emmy Noether', 'Marie Curie'],
 'physics': ['Tu Youyou', 'Sally Ride']}

以上代碼實現了根據科學家領域來分類。從定義的reducer(acc, item)函數中可以看出,每一次的reducer返回的acc其實就是下一次輸入的acc。初始值是{‘math’:[], ‘physics’:[], ‘chemistry’:[], ‘astronomy’:[]},這裏需要注意的是,不能寫錯key。
這麼賦初值很麻煩,對於上述代碼,可以使用defaultdict():

>>> scientists_by_field = reduce(
	reducer,
	__scientists,
	defaultdict(list)
	)
>>> pprint(scientists_by_field)
defaultdict(<class 'list'>,
            {'astronomy': ['Vera Rubin'],
             'chemistry': ['Ada Yonath'],
             'math': ['Ada Lovelace', 'Emmy Noether', 'Marie Curie'],
             'physics': ['Tu Youyou', 'Sally Ride']})

如果用lambda表達式,個人覺得這種寫法很不好,真的沒必要爲了pythonic而pythonic:

>>> scientists_by_field = reduce(
	lambda acc, item: {**acc, **{item.field: acc[item.field]+[item.name]}},
	__scientists,
	{'math':[], 'physics':[], 'chemistry':[], 'astronomy':[]}
	)
>>> pprint(scientists_by_field)
{'astronomy': ['Vera Rubin'],
 'chemistry': ['Ada Yonath'],
 'math': ['Ada Lovelace', 'Emmy Noether', 'Marie Curie'],
 'physics': ['Tu Youyou', 'Sally Ride']}

通過groupby函數可以實現類似功能:

>>> import itertools
>>> scientists_by_field = {
	item[0]: list(item[1])
	for item in itertools.groupby(__scientists, lambda x: x.field)
	}
>>> pprint(scientists_by_field)
{'astronomy': [Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False)],
 'chemistry': [Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True)],
 'math': [Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
          Scientist(name='Emmy Noether', field='math', born=1882, nobel=False),
          Scientist(name='Marie Curie', field='math', born=1867, nobel=True)],
 'physics': [Scientist(name='Sally Ride', field='physics', born=1951, nobel=False)]}

小結:

以上介紹了函數式編程的三個基本函數,通過示例可以看出,其實每個函數實現的功能似乎都可以通過列表解析式和生成器表達式得出。但是不同的是,函數式編程在一些情況下處理的速度更快,特別是在並行計算過程中。下邊利用函數式編程對並行計算做簡單的介紹。

並行計算

先看一段代碼:

#in parallel.py
import collections
from pprint import pprint

Scientist = collections.namedtuple('Scientist', [
    'name',
    'field',
    'born',
    'nobel'
])

__scientists = (
    Scientist(name = 'Ada Lovelace', field = 'math', born = 1815, nobel = False),
    Scientist(name = 'Emmy Noether', field = 'math', born = 1882, nobel = False),
    Scientist(name = 'Marie Curie', field = 'math', born = 1867, nobel = True),
    Scientist(name = 'Tu Youyou', field = 'physics', born = 1930, nobel = True),
    Scientist(name = 'Ada Yonath', field = 'chemistry', born = 1939, nobel = True),
    Scientist(name = 'Vera Rubin', field = 'astronomy', born = 1928, nobel = False),
    Scientist(name = 'Sally Ride', field = 'physics', born = 1951, nobel = False)
)

pprint(__scientists)

def transform(item):
    return {'name':item.name,'age':2020 - item.born}

res = tuple(map(
    transform,
    __scientists
))

pprint(res)

如果之前的內容認真看了的話,這代碼應該不難理解,創建了一個__scientists不可變數據,並利用map()生成一個新的數據,數據內容只包含姓名和年齡。
這部分代碼的運行結果:

(Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
 Scientist(name='Emmy Noether', field='math', born=1882, nobel=False),
 Scientist(name='Marie Curie', field='math', born=1867, nobel=True),
 Scientist(name='Tu Youyou', field='physics', born=1930, nobel=True),
 Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True),
 Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False),
 Scientist(name='Sally Ride', field='physics', born=1951, nobel=False))
({'age': 205, 'name': 'Ada Lovelace'},
 {'age': 138, 'name': 'Emmy Noether'},
 {'age': 153, 'name': 'Marie Curie'},
 {'age': 90, 'name': 'Tu Youyou'},
 {'age': 81, 'name': 'Ada Yonath'},
 {'age': 92, 'name': 'Vera Rubin'},
 {'age': 69, 'name': 'Sally Ride'})

你可以嘗試自己運行一下這段代碼,你會發現結果很快被打印出來了。但是這種情況很理想話,實際的問題中遠比這麻煩,比如:會遇到處理大量數據,或者需要連續獲取網頁端數據等情況。這時,上述代碼的性能或處理時間會降低。還是以上邊代碼模擬這種情況。
修改transform()函數並使用time計時如下:

import collections
import time
from pprint import pprint

Scientist = collections.namedtuple('Scientist', [
    'name',
    'field',
    'born',
    'nobel'
])

__scientists = (
    Scientist(name = 'Ada Lovelace', field = 'math', born = 1815, nobel = False),
    Scientist(name = 'Emmy Noether', field = 'math', born = 1882, nobel = False),
    Scientist(name = 'Marie Curie', field = 'math', born = 1867, nobel = True),
    Scientist(name = 'Tu Youyou', field = 'physics', born = 1930, nobel = True),
    Scientist(name = 'Ada Yonath', field = 'chemistry', born = 1939, nobel = True),
    Scientist(name = 'Vera Rubin', field = 'astronomy', born = 1928, nobel = False),
    Scientist(name = 'Sally Ride', field = 'physics', born = 1951, nobel = False)
)
# pprint(__scientists)
def transform(item):
    print(f'processing record {item.name}')
    time.sleep(1)
    res = {'name':item.name,'age':2020 - item.born}
    print(f'Done processing record {item.name}')
    return res

start = time.time()
res = tuple(map(
    transform,
    __scientists
))
end = time.time()

print(f'Time to complete: {end - start}')
# pprint(res)

再次運行,結果如下:

processing record Ada Lovelace
Done processing record Ada Lovelace
processing record Emmy Noether
Done processing record Emmy Noether
processing record Marie Curie
Done processing record Marie Curie
processing record Tu Youyou
Done processing record Tu Youyou
processing record Ada Yonath
Done processing record Ada Yonath
processing record Vera Rubin
Done processing record Vera Rubin
processing record Sally Ride
Done processing record Sally Ride
Time to complete: 7.009661436080933

從輸出結果看,數據是一個一個進行處理,總時長7秒左右。可以想象,如果更多的數據需要處理,那麼時間將會更長。這時,函數式編程的三個函數的優勢就體現出來了。通過它們,可以使用並行計算,從而更好的利用cpu。

import collections
import time
from pprint import pprint

import multiprocessing


Scientist = collections.namedtuple('Scientist', [
    'name',
    'field',
    'born',
    'nobel'
])

__scientists = (
    Scientist(name = 'Ada Lovelace', field = 'math', born = 1815, nobel = False),
    Scientist(name = 'Emmy Noether', field = 'math', born = 1882, nobel = False),
    Scientist(name = 'Marie Curie', field = 'math', born = 1867, nobel = True),
    Scientist(name = 'Tu Youyou', field = 'physics', born = 1930, nobel = True),
    Scientist(name = 'Ada Yonath', field = 'chemistry', born = 1939, nobel = True),
    Scientist(name = 'Vera Rubin', field = 'astronomy', born = 1928, nobel = False),
    Scientist(name = 'Sally Ride', field = 'physics', born = 1951, nobel = False)
)

# pprint(__scientists)

def transform(item):
    print(f'processing record {item.name}')
    time.sleep(1)
    res = {'name':item.name,'age':2020 - item.born}
    print(f'Done processing record {item.name}')
    return res
    
if __name__ == '__main__':
    start = time.time()
    pool = multiprocessing.Pool()
    res = pool.map(transform,__scientists)
    end = time.time()
    print(f'Time to complete: {end - start}')

注意 :

if __name__ == '__main__':

運行結果:

processing record Ada Lovelace
processing record Emmy Noether
processing record Marie Curie
processing record Tu Youyou
processing record Ada Yonath
processing record Vera Rubin
processing record Sally Ride
Done processing record Ada Lovelace
Done processing record Emmy Noether
Done processing record Marie Curie
Done processing record Tu Youyou
Done processing record Ada Yonath
Done processing record Vera Rubin
Done processing record Sally Ride
Time to complete: 1.498938798904419

通過結果可以明顯看出,並行計算的效率之高,此時數據的處理不是一個接一個,而是多個數據並行處理。

進一步探索:

將transform()函數改成如下,讓它運行過程中顯示進程id

import os

def transform(item):
    print(f'Process {os.getpid()} working record {item.name}')
    time.sleep(1)
    res = {'name':item.name,'age':2020 - item.born}
    print(f'Process {os.getpid()} done processing record {item.name}')
    return res

輸出結果:

Process 15812 working record Ada Lovelace
Process 22092 working record Emmy Noether
Process 12756 working record Marie Curie
Process 20052 working record Tu Youyou
Process 7324 working record Ada Yonath
Process 30060 working record Vera Rubin
Process 23028 working record Sally Ride
Process 15812 done processing record Ada Lovelace
Process 22092 done processing record Emmy Noether
Process 12756 done processing record Marie Curie
Process 20052 done processing record Tu Youyou
Process 7324 done processing record Ada Yonath
Process 30060 done processing record Vera Rubin
Process 23028 done processing record Sally Ride
Time to complete: 1.446131706237793

可以通過對Pool()函數傳入參數明確要幾個進程處理數據:

pool = multiprocessing.Pool(processes=2)

輸出結果:

Process 14684 working record Ada Lovelace
Process 31848 working record Emmy Noether
Process 14684 done processing record Ada Lovelace
Process 14684 working record Marie Curie
Process 31848 done processing record Emmy Noether
Process 31848 working record Tu Youyou
Process 14684 done processing record Marie Curie
Process 14684 working record Ada Yonath
Process 31848 done processing record Tu Youyou
Process 31848 working record Vera Rubin
Process 14684 done processing record Ada Yonath
Process 14684 working record Sally Ride
Process 31848 done processing record Vera Rubin
Process 14684 done processing record Sally Ride
Time to complete: 4.218588829040527

結果顯示,使用了兩個進程處理數據。如果不限定,則是有幾個空閒就用幾個。

concurrent.futures

對於python3.x,concurrent.futures可以更好的實現並行計算。代碼如下:

import concurrent.futures

if __name__ == '__main__':
    start = time.time()
    with concurrent.futures.ProcessPoolExecutor() as executor:
        res = executor.map(transform,__scientists)
    # pool = multiprocessing.Pool(processes=2)
    # res = pool.map(transform,__scientists)
    end = time.time()
    print(f'Time to complete: {end - start}')
    pprint(tuple(res))

上述代碼同樣實現了並行計算,與之前multiprocessing不同的是,concurrent.futures用到了with和上下文管理器。關於with和上下文管理器的用法這裏不重點討論。將原始代碼的並行計算部分改爲上述代碼運行結果:

Process 13996 working record Ada Lovelace
Process 18116 working record Emmy Noether
Process 32556 working record Marie Curie
Process 10320 working record Tu Youyou
Process 31808 working record Ada Yonath
Process 9552 working record Vera Rubin
Process 30600 working record Sally Ride
Process 13996 done processing record Ada Lovelace
Process 18116 done processing record Emmy Noether
Process 32556 done processing record Marie Curie
Process 10320 done processing record Tu Youyou
Process 31808 done processing record Ada Yonath
Process 9552 done processing record Vera Rubin
Process 30600 done processing record Sally Ride
Time to complete: 1.6321463584899902
({'age': 205, 'name': 'Ada Lovelace'},
 {'age': 138, 'name': 'Emmy Noether'},
 {'age': 153, 'name': 'Marie Curie'},
 {'age': 90, 'name': 'Tu Youyou'},
 {'age': 81, 'name': 'Ada Yonath'},
 {'age': 92, 'name': 'Vera Rubin'},
 {'age': 69, 'name': 'Sally Ride'})

多說一點點:

concurrent.futures也可以執行多線程的運算,但是由於python的全局解釋器鎖定(GIL)的原因,事實上,只有一個線程可以執行python代碼。因此,即使設置了多線程,但實際也只能有一個線程執行python代碼。這個問題在這裏不做深入討論,以後會單獨寫這個問題。這裏可以簡單理解爲,一般情況下,在python中,應該使用基於進程的並行計算。當然如果遇到像 i/o操作,當i/O操作造成cpu閒置的情況時,會釋放GIL。

總結:

本文主要介紹了什麼是函數式編程,函數式編程的特點以及三個基本函數,並且基於函數式編程的基本函數,對並行計算有了一點認識。函數式編程作爲一種編程方式,給我們在編碼過程中提供了新的思路,在並行計算上函數式編程有着很大的優勢。

最後的最後:

您的點贊留言,是對我最大的鼓勵
歡迎交流討論,謝謝~

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章