Python基於socket的多進程分佈式計算Demo

前言

通過對multiprocessing.managers的學習,寫了一個基於socket的分佈式計算的小Demo。這個Demo做的事情是,master產生0-20的整數並放入task queue,slave在集羣網絡中獲取task queue取數,做sum操作並將結果放進result queue,master打印出result queue的元素值。

Test

  1. demo在本地以模擬分佈式環境運行。若需要運行在不同機器環境,則需更改client.py中的本地環回地址IP爲 server/master 機器IP。
  2. 運行 server.py 後再運行 client.py, client_2.py, client_3.py…也先運行client.py再運行server.py程序。
    client端運行的代碼都相同。

1. Master / Server

# server.py
# -*- coding:utf-8 -*-

# 多進程分佈式Demo
# 服務器端
# master服務端原理:通過managers模塊把Queue通過網絡暴露出去,其他機器的進程就可以訪問Queue了
# 服務進程負責啓動Queue,把Queue註冊到網絡上,然後往Queue裏面寫入任務,代碼如下:

import random, queue
from multiprocessing.managers import BaseManager
import numpy as np
import time
from jc import utils

# 初始化自定義logger
mlog = utils.my_logger("Server")

# 發送任務的隊列
task_queue = queue.Queue()
# 接收結果的隊列
result_queue = queue.Queue()


# 使用標準函數來代替lambda函數,避免python2.7中,pickle無法序列化lambda的問題
def get_task_queue():
	global task_queue
	return task_queue


# 使用標準函數來代替lambda函數,避免python2.7中,pickle無法序列化lambda的問題
def get_result_queue():
	global task_queue
	return task_queue


def startManager(host, port, authkey):
    # 把兩個Queue都註冊到網絡上,callable參數關聯了Queue對象,注意回調函數不能使用括號
    BaseManager.register('get_task_queue', callable=get_task_queue)
    BaseManager.register('get_result_queue', callable=get_result_queue)
    # 設置host,綁定端口port,設置驗證碼爲authkey
    manager = BaseManager(address=(host, port), authkey=authkey)
    # 啓動manager服務器
    manager.start()
    return manager


def put_queue(manager, objs):
    # 通過網絡訪問queueu
    task = manager.get_task_queue()
    for obj in objs:
        try:
            #print("Put obj:{}".format(obj))
            mlog.info("Put obj:{}".format(obj))
            task.put(obj)
            time.sleep(1)
        except queue.Full:
            mlog.info("put_queue task full.exit ")
            break

def get_result(worker):
    # 通過網絡訪問queueu
    result = worker.get_result_queue()
    while 1:
        try:
            n = result.get(timeout=10)
            mlog.info("Result get {}".format(n))
            time.sleep(1)
        except queue.Empty:
            mlog.info("get_result result empty...retring")
            continue
        else:
            pass


if __name__ == "__main__":

    host = '127.0.0.1'
    port = 5000
    authkey = b'abc'
    # 啓動manager服務器
    manager = startManager(host, port, authkey)

    # 數據
    data = np.arange(0,20)

    # 給task隊列添加數據
    put_queue(manager, data)
    #get_queue(manager)

    get_result(manager)

    # 關閉服務器
    manager.shutdown

2. Slave / Client

# client.py
# -*- coding:utf-8 -*-

# 在分佈式多進程環境下,添加任務到Queue不可以直接對原始的task_queue進行操作,
# 那樣就繞過了QueueManager的封裝,必須通過manager.get_task_queue()獲得的Queue接口添加。

import time, queue
from multiprocessing.managers import BaseManager
from jc import utils

# 初始化自定義logger
mlog = utils.my_logger("Server")

cal_queue = queue.Queue(3)

def start_worker(host, port, authkey):
	# 由於這個BaseManager只從網絡上獲取queue,所以註冊時只提供名字
	BaseManager.register('get_task_queue')
	BaseManager.register('get_result_queue')
	mlog.info ('Connect to server %s' % host)
	# 注意,端口port和驗證碼authkey必須和manager服務器設置的完全一致
	worker = BaseManager(address=(host, port), authkey=authkey)
	# 鏈接到manager服務器
	try:
		worker.connect()
	except Exception as e:
		mlog.exception(e)
		mlog.info("Tring reconnection...")
		time.sleep(1)
		start_worker(host, port, authkey)
	else:
		mlog.info('Connecting server %s' % host)
		return worker

def get_queue(worker):
	if not worker:
		mlog.info("worker is None, exit")

	task = worker.get_task_queue()
	result = worker.get_result_queue()
	# 從task隊列取數據,並添加到result隊列中

	tag = 0
	while 1:
		tag = tag + 1
		if cal_queue.full() or (tag>3 and not cal_queue.empty()):
			cal_sum = 0
			while not cal_queue.empty():
				cal_sum += cal_queue.get()
			result.put(cal_sum)
			mlog.info('result put %d' % cal_sum)
			tag = 0
		try:
			n = task.get(timeout=10)
			mlog.info('worker get %d' % n)
			cal_queue.put(n)
			time.sleep(1)
		except queue.Empty:
			mlog.info("get_queue task empty...retring")
			time.sleep(1)
			continue
		except queue.Full:
			mlog.info("put_cal_queue task full...waiting")
			time.sleep(1)
			continue



if __name__ == "__main__":
	host = '127.0.0.1'
	port = 5000
	authkey = b'abc'

	# 啓動worker
	worker = start_worker(host, port, authkey)
	# 獲取隊列
	get_queue(worker)

運行Log

Master 1x + Slave 2x
這裏本人不是特別明白爲什麼client1,client2的log中,queue的值被重複get,但是在master中get的結果是正確的。

  1. server:
/Users/gdlocal1/PycharmProjects/CloudSystem/venv/bin/python /Users/gdlocal1/PycharmProjects/CloudSystem/CloudSystem/test.py
2019-07-05 10:07:06,379 - Server:put_queue - INFO - Put obj:0 
2019-07-05 10:07:07,384 - Server:put_queue - INFO - Put obj:1 
2019-07-05 10:07:08,387 - Server:put_queue - INFO - Put obj:2 
2019-07-05 10:07:09,387 - Server:put_queue - INFO - Put obj:3 
2019-07-05 10:07:10,388 - Server:put_queue - INFO - Put obj:4 
2019-07-05 10:07:11,390 - Server:put_queue - INFO - Put obj:5 
2019-07-05 10:07:12,396 - Server:put_queue - INFO - Put obj:6 
2019-07-05 10:07:13,400 - Server:put_queue - INFO - Put obj:7 
2019-07-05 10:07:14,403 - Server:put_queue - INFO - Put obj:8 
2019-07-05 10:07:15,405 - Server:put_queue - INFO - Put obj:9 
2019-07-05 10:07:16,409 - Server:put_queue - INFO - Put obj:10 
2019-07-05 10:07:17,412 - Server:put_queue - INFO - Put obj:11 
2019-07-05 10:07:18,416 - Server:put_queue - INFO - Put obj:12 
2019-07-05 10:07:19,420 - Server:put_queue - INFO - Put obj:13 
2019-07-05 10:07:20,423 - Server:put_queue - INFO - Put obj:14 
2019-07-05 10:07:21,428 - Server:put_queue - INFO - Put obj:15 
2019-07-05 10:07:22,433 - Server:put_queue - INFO - Put obj:16 
2019-07-05 10:07:23,437 - Server:put_queue - INFO - Put obj:17 
2019-07-05 10:07:24,440 - Server:put_queue - INFO - Put obj:18 
2019-07-05 10:07:25,444 - Server:put_queue - INFO - Put obj:19 
2019-07-05 10:07:36,458 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:07:46,462 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:07:56,466 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:07:59,685 - Server:get_result - INFO - Result get 97 
2019-07-05 10:08:10,694 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:08:20,698 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:08:30,702 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:08:34,503 - Server:get_result - INFO - Result get 93 
2019-07-05 10:08:45,511 - Server:get_result - INFO - get_result result empty...retring 
2019-07-05 10:08:55,514 - Server:get_result - INFO - get_result result empty...retring 
  1. client 1:
/Users/gdlocal1/PycharmProjects/CloudSystem/venv/bin/python /Users/gdlocal1/PycharmProjects/CloudSystem/CloudSystem/test_2.py
2019-07-05 10:07:10,559 - Server:start_worker - INFO - Connect to server 127.0.0.1 
2019-07-05 10:07:10,561 - Server:start_worker - INFO - Connecting server 127.0.0.1 
2019-07-05 10:07:10,648 - Server:get_queue - INFO - worker get 0 
2019-07-05 10:07:11,650 - Server:get_queue - INFO - worker get 1 
2019-07-05 10:07:12,655 - Server:get_queue - INFO - worker get 2 
2019-07-05 10:07:13,656 - Server:get_queue - INFO - result put 3 
2019-07-05 10:07:13,656 - Server:get_queue - INFO - worker get 4 
2019-07-05 10:07:14,660 - Server:get_queue - INFO - worker get 6 
2019-07-05 10:07:15,665 - Server:get_queue - INFO - worker get 3 
2019-07-05 10:07:16,670 - Server:get_queue - INFO - result put 13 
2019-07-05 10:07:16,671 - Server:get_queue - INFO - worker get 9 
2019-07-05 10:07:17,675 - Server:get_queue - INFO - worker get 15 
2019-07-05 10:07:18,680 - Server:get_queue - INFO - worker get 11 
2019-07-05 10:07:19,685 - Server:get_queue - INFO - result put 35 
2019-07-05 10:07:19,685 - Server:get_queue - INFO - worker get 13 
2019-07-05 10:07:20,689 - Server:get_queue - INFO - worker get 35 
2019-07-05 10:07:21,694 - Server:get_queue - INFO - worker get 15 
2019-07-05 10:07:22,698 - Server:get_queue - INFO - result put 63 
2019-07-05 10:07:22,698 - Server:get_queue - INFO - worker get 57 
2019-07-05 10:07:23,700 - Server:get_queue - INFO - worker get 17 
2019-07-05 10:07:25,445 - Server:get_queue - INFO - worker get 19 
2019-07-05 10:07:26,447 - Server:get_queue - INFO - result put 93 
2019-07-05 10:07:26,448 - Server:get_queue - INFO - worker get 93 
2019-07-05 10:07:37,454 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:07:48,461 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:07:59,469 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:08:00,475 - Server:get_queue - INFO - result put 93 
2019-07-05 10:08:10,480 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:08:21,488 - Server:get_queue - INFO - get_queue task empty...retring 
  1. client 2:
/Users/gdlocal1/PycharmProjects/CloudSystem/venv/bin/python /Users/gdlocal1/PycharmProjects/CloudSystem/CloudSystem/test_3.py
2019-07-05 10:07:13,523 - Server:start_worker - INFO - Connect to server 127.0.0.1 
2019-07-05 10:07:13,524 - Server:start_worker - INFO - Connecting server 127.0.0.1 
2019-07-05 10:07:13,619 - Server:get_queue - INFO - worker get 3 
2019-07-05 10:07:14,620 - Server:get_queue - INFO - worker get 5 
2019-07-05 10:07:15,622 - Server:get_queue - INFO - worker get 7 
2019-07-05 10:07:16,626 - Server:get_queue - INFO - result put 15 
2019-07-05 10:07:16,626 - Server:get_queue - INFO - worker get 8 
2019-07-05 10:07:17,631 - Server:get_queue - INFO - worker get 10 
2019-07-05 10:07:18,635 - Server:get_queue - INFO - worker get 13 
2019-07-05 10:07:19,640 - Server:get_queue - INFO - result put 31 
2019-07-05 10:07:19,640 - Server:get_queue - INFO - worker get 12 
2019-07-05 10:07:20,643 - Server:get_queue - INFO - worker get 31 
2019-07-05 10:07:21,647 - Server:get_queue - INFO - worker get 14 
2019-07-05 10:07:22,652 - Server:get_queue - INFO - result put 57 
2019-07-05 10:07:22,652 - Server:get_queue - INFO - worker get 16 
2019-07-05 10:07:23,657 - Server:get_queue - INFO - worker get 63 
2019-07-05 10:07:24,658 - Server:get_queue - INFO - worker get 18 
2019-07-05 10:07:25,662 - Server:get_queue - INFO - result put 97 
2019-07-05 10:07:25,663 - Server:get_queue - INFO - worker get 97 
2019-07-05 10:07:36,665 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:07:47,672 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:07:58,680 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:07:59,685 - Server:get_queue - INFO - result put 97 
2019-07-05 10:08:00,475 - Server:get_queue - INFO - worker get 93 
2019-07-05 10:08:11,485 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:08:22,492 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:08:33,500 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:08:34,503 - Server:get_queue - INFO - result put 93 
2019-07-05 10:08:44,508 - Server:get_queue - INFO - get_queue task empty...retring 
2019-07-05 10:08:55,514 - Server:get_queue - INFO - get_queue task empty...retring 

GitLab

代碼維護在Gitlab

https://gitlab.com/cyril_j/mutils/tree/master/Python/Distributed_Computer_demo

參考

  1. 基於socket的python分佈式運算中多服務器間的通信問題
  2. 分佈式進程 - 廖雪峯
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章