python 分佈式計算

dispy簡介
dispy論其實現,還是比較複雜的,但由於大牛們精巧的封裝和設計,dispy作爲一項工具,非常容易入手並深入使用。dispy既適用於在單機多處理器(SMP)環境下並行計算,也適用在計算機集羣上進行並行計算。dispy沒有提供各個子任務的通信機制,如有需要,可以直接使用asyncoro庫進行通信和同步。
dispy支持python2.7+和python3.1+版本,使用pip工具可以直接安裝:sudo pip install dispy。

dispy使用
dispy的精簡性的最大體現,就是其邏輯結構極其扁平。它只有四個組件:dispy.py、dispynode.py、dispyscheduler.py、dispynetrelay.py。
拿一個集羣來說,每一個主機都可以認爲是一個node,而每次提交任務的主體,可以認爲是一個client。client提交計算任務,而node進行實際計算。client分兩種類型,JobCluster和SharedJobCluster,二者功能基本一致,區別在於,一個主機上只可以有一個JobCluster實例,而可以存在多個SharedJobCluster實例。dispynode.py、dispyscheduler.py分別對應於兩種client的node。
dispynetrelay的作用在於對跨域分佈式計算提供支持。

node創建
node的創建不需要編寫任何代碼,在成功安裝dispy後,直接在命令行輸入dispynode.py即可運行node(–daemon可以創建守護進程運行)。node運行起來後,即可以提交任務進行計算。在創建node時也可以同時指定參數,如設定本地使用的最多CPU數量、向client傳遞文件時的最大size、指定端口等等。

創建client提交任務
前文已經提到過,dispy可以創建分兩種類型client:JobCluster和SharedJobCluster。JobCluster的定義如下:
Class dispy.JobCluster (computation, nodes=[‘*’], depends=[], callback=None, cluster_status=None, ip_addr=None, ext_ip_addr=None, port=51347, node_port=51348, recover_file=None, dest_path=None, loglevel=logging.WARNING, setup=None, cleanup=True, pulse_interval=None, ping_interval=None, reentrant=False, secret=”, keyfile=None, certfile=None)

重要參數說明:
computation:計算主體,可以是一個函數(直接調用),也可以是一個可執行文件(用字符串表示);
node:依賴的機羣中其他node的IP或域名;
depends:依賴的模塊、文件或者類,如果其他node沒有安裝該client所在環境所具有的模塊,可以通過此方式共享;
callback:回調函數,指定提交的任務返回中間值或者完成後要執行的行爲;
cluster_status:用戶指定的函數,在node狀態變化時執行該函數;
ip_addr,ext_ip_addr:NAT相關,指定外網IP;
port,node_port:指定與其他client和node通信的端口;
recover_file:存放client運行狀態的文件的名稱,可以用來恢復client;
dest_path:各個client回傳文件時在本地存放文件的路徑;
loglevel:日誌級別;
setup:指定各個node在運行任務時的初始化函數;
clean:是否清除分佈任務生成的文件;
pulse_interval=None, ping_interval=None, reentrant=False:心跳保活相關,reentrant指定node已掛時的策略,reentrant爲True時該node在自己已掛時會將任務移交其他node進行,reentrant爲False時,node節點掛掉的任務會被取消;
secret,keyfile,certfile:通信安全相關以及SSH相關參數。

SharedJobCluster與JobCluster幾乎一致,不再贅述。
Class dispy.SharedJobCluster(computation, nodes=[‘*’], depends=[], ip_addr=None, port=51347, scheduler_node=None, scheduler_port=None, ext_ip_addr=None, dest_path=None, loglevel=logging.WARNING, cleanup=True, reentrant=False, exclusive=False, secret=”, keyfile=None, certfile=None)

建立起來的Cluster可以通過submit方法提交任務,之後可以通過該方法返回的Job獲取id,status(Created, Running, Finished, Cancelled orTerminated),ip_addr。提交的任務實體可以被稱爲Job,最終結果的返回值包含result,std_out,std_err,exception,start_time,end_time。

executed on each node before any jobs are scheduled def setup(data_file): # read data in file to global variable global data, algorithms # if running under Windows, modules can't be global, as they are not # serializable; instead, they must be loaded in 'compute' (jobs); under # Posix (Linux, OS X and other Unix variants), modules declared global in # 'setup' will be available in 'compute' # 'os' module is already available (loaded by dispynode) if os.name != 'nt': # if not Windows, load hashilb module in global scope global hashlib import hashlib data = open(data_file).read() # read file in to memory; data_file can now be deleted if sys.version_info.major > 2: data = data.encode() # convert to bytes algorithms = list(hashlib.algorithms_guaranteed) else: algorithms = hashlib.algorithms return 0 # successfull initialization should return 0 def cleanup(): global data del data if os.name != 'nt': global hashlib del hashlib def compute(n): if os.name == 'nt': # Under Windows modules must be loaded in jobs import hashlib # 'data' and 'algorithms' global variables are initialized in 'setup' alg = algorithms[n % len(algorithms)] csum = getattr(hashlib, alg)() csum.update(data) return (alg, csum.hexdigest()) if __name__ == '__main__': import dispy, sys, functools # if no data file name is given, use this file as data file data_file = sys.argv[1] if len(sys.argv) > 1 else sys.argv[0] cluster = dispy.JobCluster(compute, depends=[data_file], setup=functools.partial(setup, data_file), cleanup=cleanup) jobs = [] for n in range(10): job = cluster.submit(n) job.id = n jobs.append(job) for job in jobs: job() if job.status == dispy.DispyJob.Finished: print('%s: %s : %s' % (job.id, job.result[0], job.result[1])) else: print(job.exception) cluster.print_status() cluster.close()
job computation runs at dispynode servers def compute(path): import hashlib, time, os csum = hashlib.sha1() with open(os.path.basename(path), 'rb') as fd: while True: data = fd.read(1024000) if not data: break csum.update(data) time.sleep(5) return csum.hexdigest() # 'cluster_status' callback function. It is called by dispy (client) # to indicate node / job status changes. Here node initialization and # job done status are used to schedule jobs, so at most one job is # running on a node (even if a node has more than one processor). Data # files are assumed to be 'data000', 'data001' etc. def status_cb(status, node, job): if status == dispy.DispyJob.Finished: print('sha1sum for %s: %s' % (job.id, job.result)) elif status == dispy.DispyJob.Terminated: print('sha1sum for %s failed: %s' % (job.id, job.exception)) elif status == dispy.DispyNode.Initialized: print('node %s with %s CPUs available' % (node.ip_addr, node.avail_cpus)) else: # ignore other status messages return global submitted data_file = 'data%03d' % submitted if os.path.isfile(data_file): submitted += 1 # 'node' and 'dispy_job_depends' are consumed by dispy; # 'compute' is called with only 'data_file' as argument(s) job = cluster.submit_node(node, data_file, dispy_job_depends=[data_file]) job.id = data_file if __name__ == '__main__': import dispy, sys, os cluster = dispy.JobCluster(compute, cluster_status=status_cb) submitted = 0 while True: try: cmd = sys.stdin.readline().strip().lower() except KeyboardInterrupt: break if cmd == 'quit' or cmd == 'exit': break cluster.wait() cluster.print_status()
a version of word frequency example from mapreduce tutorial def mapper(doc): # input reader and map function are combined import os words = [] with open(os.path.join('/tmp', doc)) as fd: for line in fd: words.extend((word.lower(), 1) for word in line.split() \ if len(word) > 3 and word.isalpha()) return words def reducer(words): # we should generate sorted lists which are then merged, # but to keep things simple, we use dicts word_count = {} for word, count in words: if word not in word_count: word_count[word] = 0 word_count[word] += count # print('reducer: %s to %s' % (len(words), len(word_count))) return word_count if __name__ == '__main__': import dispy, logging # assume nodes node1 and node2 have 'doc1', 'doc2' etc. on their # local storage, so no need to transfer them map_cluster = dispy.JobCluster(mapper, nodes=['node1', 'node2'], reentrant=True) # any node can work on reduce reduce_cluster = dispy.JobCluster(reducer, nodes=['*'], reentrant=True) map_jobs = [] for f in ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']: job = map_cluster.submit(f) map_jobs.append(job) reduce_jobs = [] for map_job in map_jobs: words = map_job() if not words: print(map_job.exception) continue # simple partition n = 0 while n < len(words): m = min(len(words) - n, 1000) reduce_job = reduce_cluster.submit(words[n:n+m]) reduce_jobs.append(reduce_job) n += m # reduce word_count = {} for reduce_job in reduce_jobs: words = reduce_job() if not words: print(reduce_job.exception) continue for word, count in words.iteritems(): if word not in word_count: word_count[word] = 0 word_count[word] += count # sort words by frequency and print for word in sorted(word_count, key=lambda x: word_count[x], reverse=True): count = word_count[word] print(word, count) reduce_cluster.print_status()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章