python 分佈式計算

dispy簡介
dispy論其實現，還是比較複雜的，但由於大牛們精巧的封裝和設計，dispy作爲一項工具，非常容易入手並深入使用。dispy既適用於在單機多處理器（SMP）環境下並行計算，也適用在計算機集羣上進行並行計算。dispy沒有提供各個子任務的通信機制，如有需要，可以直接使用asyncoro庫進行通信和同步。
dispy支持python2.7+和python3.1+版本，使用pip工具可以直接安裝：sudo pip install dispy。

dispy使用
dispy的精簡性的最大體現，就是其邏輯結構極其扁平。它只有四個組件：dispy.py、dispynode.py、dispyscheduler.py、dispynetrelay.py。
拿一個集羣來說，每一個主機都可以認爲是一個node，而每次提交任務的主體，可以認爲是一個client。client提交計算任務，而node進行實際計算。client分兩種類型，JobCluster和SharedJobCluster，二者功能基本一致，區別在於，一個主機上只可以有一個JobCluster實例，而可以存在多個SharedJobCluster實例。dispynode.py、dispyscheduler.py分別對應於兩種client的node。
dispynetrelay的作用在於對跨域分佈式計算提供支持。

node創建
node的創建不需要編寫任何代碼，在成功安裝dispy後，直接在命令行輸入dispynode.py即可運行node（–daemon可以創建守護進程運行）。node運行起來後，即可以提交任務進行計算。在創建node時也可以同時指定參數，如設定本地使用的最多CPU數量、向client傳遞文件時的最大size、指定端口等等。

創建client提交任務
前文已經提到過，dispy可以創建分兩種類型client：JobCluster和SharedJobCluster。JobCluster的定義如下：
Class dispy.JobCluster (computation, nodes=[‘*’], depends=[], callback=None, cluster_status=None, ip_addr=None, ext_ip_addr=None, port=51347, node_port=51348, recover_file=None, dest_path=None, loglevel=logging.WARNING, setup=None, cleanup=True, pulse_interval=None, ping_interval=None, reentrant=False, secret=”, keyfile=None, certfile=None)

重要參數說明：
computation：計算主體，可以是一個函數（直接調用），也可以是一個可執行文件（用字符串表示）；
node：依賴的機羣中其他node的IP或域名；
depends：依賴的模塊、文件或者類，如果其他node沒有安裝該client所在環境所具有的模塊，可以通過此方式共享；
callback：回調函數，指定提交的任務返回中間值或者完成後要執行的行爲；
cluster_status：用戶指定的函數，在node狀態變化時執行該函數；
ip_addr,ext_ip_addr：NAT相關，指定外網IP；
port,node_port：指定與其他client和node通信的端口；
recover_file：存放client運行狀態的文件的名稱，可以用來恢復client；
dest_path：各個client回傳文件時在本地存放文件的路徑；
loglevel：日誌級別；
setup：指定各個node在運行任務時的初始化函數；
clean：是否清除分佈任務生成的文件；
pulse_interval=None, ping_interval=None, reentrant=False：心跳保活相關，reentrant指定node已掛時的策略，reentrant爲True時該node在自己已掛時會將任務移交其他node進行，reentrant爲False時，node節點掛掉的任務會被取消；
secret,keyfile,certfile：通信安全相關以及SSH相關參數。

SharedJobCluster與JobCluster幾乎一致，不再贅述。
Class dispy.SharedJobCluster(computation, nodes=[‘*’], depends=[], ip_addr=None, port=51347, scheduler_node=None, scheduler_port=None, ext_ip_addr=None, dest_path=None, loglevel=logging.WARNING, cleanup=True, reentrant=False, exclusive=False, secret=”, keyfile=None, certfile=None)

建立起來的Cluster可以通過submit方法提交任務，之後可以通過該方法返回的Job獲取id，status（Created, Running, Finished, Cancelled orTerminated），ip_addr。提交的任務實體可以被稱爲Job，最終結果的返回值包含result，std_out，std_err，exception，start_time，end_time。

executed on each node before any jobs are scheduled def setup(data_file): # read data in file to global variable global data, algorithms # if running under Windows, modules can't be global, as they are not # serializable; instead, they must be loaded in 'compute' (jobs); under # Posix (Linux, OS X and other Unix variants), modules declared global in # 'setup' will be available in 'compute' # 'os' module is already available (loaded by dispynode) if os.name != 'nt': # if not Windows, load hashilb module in global scope global hashlib import hashlib data = open(data_file).read() # read file in to memory; data_file can now be deleted if sys.version_info.major > 2: data = data.encode() # convert to bytes algorithms = list(hashlib.algorithms_guaranteed) else: algorithms = hashlib.algorithms return 0 # successfull initialization should return 0 def cleanup(): global data del data if os.name != 'nt': global hashlib del hashlib def compute(n): if os.name == 'nt': # Under Windows modules must be loaded in jobs import hashlib # 'data' and 'algorithms' global variables are initialized in 'setup' alg = algorithms[n % len(algorithms)] csum = getattr(hashlib, alg)() csum.update(data) return (alg, csum.hexdigest()) if __name__ == '__main__': import dispy, sys, functools # if no data file name is given, use this file as data file data_file = sys.argv[1] if len(sys.argv) > 1 else sys.argv[0] cluster = dispy.JobCluster(compute, depends=[data_file], setup=functools.partial(setup, data_file), cleanup=cleanup) jobs = [] for n in range(10): job = cluster.submit(n) job.id = n jobs.append(job) for job in jobs: job() if job.status == dispy.DispyJob.Finished: print('%s: %s : %s' % (job.id, job.result[0], job.result[1])) else: print(job.exception) cluster.print_status() cluster.close()

job computation runs at dispynode servers def compute(path): import hashlib, time, os csum = hashlib.sha1() with open(os.path.basename(path), 'rb') as fd: while True: data = fd.read(1024000) if not data: break csum.update(data) time.sleep(5) return csum.hexdigest() # 'cluster_status' callback function. It is called by dispy (client) # to indicate node / job status changes. Here node initialization and # job done status are used to schedule jobs, so at most one job is # running on a node (even if a node has more than one processor). Data # files are assumed to be 'data000', 'data001' etc. def status_cb(status, node, job): if status == dispy.DispyJob.Finished: print('sha1sum for %s: %s' % (job.id, job.result)) elif status == dispy.DispyJob.Terminated: print('sha1sum for %s failed: %s' % (job.id, job.exception)) elif status == dispy.DispyNode.Initialized: print('node %s with %s CPUs available' % (node.ip_addr, node.avail_cpus)) else: # ignore other status messages return global submitted data_file = 'data%03d' % submitted if os.path.isfile(data_file): submitted += 1 # 'node' and 'dispy_job_depends' are consumed by dispy; # 'compute' is called with only 'data_file' as argument(s) job = cluster.submit_node(node, data_file, dispy_job_depends=[data_file]) job.id = data_file if __name__ == '__main__': import dispy, sys, os cluster = dispy.JobCluster(compute, cluster_status=status_cb) submitted = 0 while True: try: cmd = sys.stdin.readline().strip().lower() except KeyboardInterrupt: break if cmd == 'quit' or cmd == 'exit': break cluster.wait() cluster.print_status()

a version of word frequency example from mapreduce tutorial def mapper(doc): # input reader and map function are combined import os words = [] with open(os.path.join('/tmp', doc)) as fd: for line in fd: words.extend((word.lower(), 1) for word in line.split() \ if len(word) > 3 and word.isalpha()) return words def reducer(words): # we should generate sorted lists which are then merged, # but to keep things simple, we use dicts word_count = {} for word, count in words: if word not in word_count: word_count[word] = 0 word_count[word] += count # print('reducer: %s to %s' % (len(words), len(word_count))) return word_count if __name__ == '__main__': import dispy, logging # assume nodes node1 and node2 have 'doc1', 'doc2' etc. on their # local storage, so no need to transfer them map_cluster = dispy.JobCluster(mapper, nodes=['node1', 'node2'], reentrant=True) # any node can work on reduce reduce_cluster = dispy.JobCluster(reducer, nodes=['*'], reentrant=True) map_jobs = [] for f in ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']: job = map_cluster.submit(f) map_jobs.append(job) reduce_jobs = [] for map_job in map_jobs: words = map_job() if not words: print(map_job.exception) continue # simple partition n = 0 while n < len(words): m = min(len(words) - n, 1000) reduce_job = reduce_cluster.submit(words[n:n+m]) reduce_jobs.append(reduce_job) n += m # reduce word_count = {} for reduce_job in reduce_jobs: words = reduce_job() if not words: print(reduce_job.exception) continue for word, count in words.iteritems(): if word not in word_count: word_count[word] = 0 word_count[word] += count # sort words by frequency and print for word in sorted(word_count, key=lambda x: word_count[x], reverse=True): count = word_count[word] print(word, count) reduce_cluster.print_status()

python 分佈式計算

mac安裝scrapy

python簡單基礎介紹

1366 (HY000): Incorrect string value

簡單的C回顧

很好的CNN學習資料

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結