MapReduce在百度百科中的解釋是:
MapReduce是一種編程模型,用於大規模數據集(大於1TB)的並行運算。"Map(映射)"和"Reduce(歸約)"是它們的主要思想,都是從函數式編程語言和矢量編程語言借來的特性。它極大地方便了編程人員在不會分佈式並行編程的情況下,將自己的程序運行在分佈式系統上。 當前的軟件實現是指定一個Map(映射)函數,用來把一組鍵值對映射成一組新的鍵值對,指定併發的Reduce(歸約)函數,用來保證所有映射的鍵值對中的每一個共享相同的鍵組。
通過這段描述,我們可以明確知道MapReduce是面向大數據並行處理的計算模型、框架和平臺,在傳統學習中,通常會在Hadoop等分佈式框架下進行MapReduce相關工作,隨着雲計算的逐漸發展,各個雲廠商也都先後推出了在線的MapReduce業務。
本文我們將通過MapReduce模型實現一個簡單的WordCount算法,區別於傳統使用Hadoop等大數據框架,我們使用的是對象存儲與雲函數的結合。
理論基礎
在開始之前,我們根據MapReduce要求先繪製一個簡單的流程圖:
在這個結構中,我們需要2個雲函數分別作Mapper和Reducer,3個對象存儲的存儲桶分別作爲輸入的存儲桶、中間臨時緩存的存儲桶以及結果存儲桶。在開始實踐前,我們先在廣州區準備3個對象存儲:
對象存儲1 ap-guangzhou srcmr
對象存儲2 ap-guangzhou middlestagebucket
對象存儲3 ap-guangzhou destcmr
爲了讓整個Mapper和Reducer邏輯更加清晰,我們先對傳統的WordCount結構進行改造,使其更加適合雲函數,同時合理分配Mapper和Reducer的工作:
功能實現
編寫Mapper相關邏輯:
# -*- coding: utf8 -*-
import datetime
from qcloud_cos_v5 import CosConfig
from qcloud_cos_v5 import CosS3Client
from qcloud_cos_v5 import CosServiceError
import re
import os
import sys
import logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
logger = logging.getLogger()
logger.setLevel(level=logging.INFO)
region = u'ap-guangzhou' # 根據實際情況,修改地域
middle_stage_bucket = 'middlestagebucket' # 根據實際情況,修改bucket名
def delete_file_folder(src):
if os.path.isfile(src):
try:
os.remove(src)
except:
pass
elif os.path.isdir(src):
for item in os.listdir(src):
itemsrc = os.path.join(src, item)
delete_file_folder(itemsrc)
try:
os.rmdir(src)
except:
pass
def download_file(cos_client, bucket, key, download_path):
logger.info("Get from [%s] to download file [%s]" % (bucket, key))
try:
response = cos_client.get_object(Bucket=bucket, Key=key, )
response['Body'].get_stream_to_file(download_path)
except CosServiceError as e:
print(e.get_error_code())
print(e.get_error_msg())
return -1
return 0
def upload_file(cos_client, bucket, key, local_file_path):
logger.info("Start to upload file to cos")
try:
response = cos_client.put_object_from_local_file(
Bucket=bucket,
LocalFilePath=local_file_path,
Key='{}'.format(key))
except CosServiceError as e:
print(e.get_error_code())
print(e.get_error_msg())
return -1
logger.info("Upload data map file [%s] Success" % key)
return 0
def do_mapping(cos_client, bucket, key, middle_stage_bucket, middle_file_key):
src_file_path = u'/tmp/' + key.split('/')[-1]
middle_file_path = u'/tmp/' + u'mapped_' + key.split('/')[-1]
download_ret = download_file(cos_client, bucket, key, src_file_path) # download src file
if download_ret == 0:
inputfile = open(src_file_path, 'r') # open local /tmp file
mapfile = open(middle_file_path, 'w') # open a new file write stream
for line in inputfile:
line = re.sub('[^a-zA-Z0-9]', ' ', line) # replace non-alphabetic/number characters
words = line.split()
for word in words:
mapfile.write('%s\t%s' % (word, 1)) # count for 1
mapfile.write('\n')
inputfile.close()
mapfile.close()
upload_ret = upload_file(cos_client, middle_stage_bucket, middle_file_key,
middle_file_path) # upload the file's each word
delete_file_folder(src_file_path)
delete_file_folder(middle_file_path)
return upload_ret
else:
return -1
def map_caller(event, context, cos_client):
appid = event['Records'][0]['cos']['cosBucket']['appid']
bucket = event['Records'][0]['cos']['cosBucket']['name'] + '-' + appid
key = event['Records'][0]['cos']['cosObject']['key']
key = key.replace('/' + str(appid) + '/' + event['Records'][0]['cos']['cosBucket']['name'] + '/', '', 1)
logger.info("Key is " + key)
middle_bucket = middle_stage_bucket + '-' + appid
middle_file_key = '/' + 'middle_' + key.split('/')[-1]
return do_mapping(cos_client, bucket, key, middle_bucket, middle_file_key)
def main_handler(event, context):
logger.info("start main handler")
if "Records" not in event.keys():
return {"errorMsg": "event is not come from cos"}
secret_id = ""
secret_key = ""
config = CosConfig(Region=region, SecretId=secret_id, SecretKey=secret_key, )
cos_client = CosS3Client(config)
start_time = datetime.datetime.now()
res = map_caller(event, context, cos_client)
end_time = datetime.datetime.now()
print("data mapping duration: " + str((end_time - start_time).microseconds / 1000) + "ms")
if res == 0:
return "Data mapping SUCCESS"
else:
return "Data mapping FAILED"
同樣的方法,建立reducer.py文件,編寫Reducer邏輯:
# -*- coding: utf8 -*-
from qcloud_cos_v5 import CosConfig
from qcloud_cos_v5 import CosS3Client
from qcloud_cos_v5 import CosServiceError
from operator import itemgetter
import os
import sys
import datetime
import logging
region = u'ap-guangzhou' # 根據實際情況,修改地域
result_bucket = u'destmr' # 根據實際情況,修改bucket名
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
logger = logging.getLogger()
logger.setLevel(level=logging.INFO)
def delete_file_folder(src):
if os.path.isfile(src):
try:
os.remove(src)
except:
pass
elif os.path.isdir(src):
for item in os.listdir(src):
itemsrc = os.path.join(src, item)
delete_file_folder(itemsrc)
try:
os.rmdir(src)
except:
pass
def download_file(cos_client, bucket, key, download_path):
logger.info("Get from [%s] to download file [%s]" % (bucket, key))
try:
response = cos_client.get_object(Bucket=bucket, Key=key, )
response['Body'].get_stream_to_file(download_path)
except CosServiceError as e:
print(e.get_error_code())
print(e.get_error_msg())
return -1
return 0
def upload_file(cos_client, bucket, key, local_file_path):
logger.info("Start to upload file to cos")
try:
response = cos_client.put_object_from_local_file(
Bucket=bucket,
LocalFilePath=local_file_path,
Key='{}'.format(key))
except CosServiceError as e:
print(e.get_error_code())
print(e.get_error_msg())
return -1
logger.info("Upload data map file [%s] Success" % key)
return 0
def qcloud_reducer(cos_client, bucket, key, result_bucket, result_key):
word2count = {}
src_file_path = u'/tmp/' + key.split('/')[-1]
result_file_path = u'/tmp/' + u'result_' + key.split('/')[-1]
download_ret = download_file(cos_client, bucket, key, src_file_path)
if download_ret == 0:
map_file = open(src_file_path, 'r')
result_file = open(result_file_path, 'w')
for line in map_file:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
logger.error("error value: %s, current line: %s" % (ValueError, line))
continue
map_file.close()
delete_file_folder(src_file_path)
sorted_word2count = sorted(word2count.items(), key=itemgetter(1))[::-1]
for wordcount in sorted_word2count:
res = '%s\t%s' % (wordcount[0], wordcount[1])
result_file.write(res)
result_file.write('\n')
result_file.close()
upload_ret = upload_file(cos_client, result_bucket, result_key, result_file_path)
delete_file_folder(result_file_path)
return upload_ret
def reduce_caller(event, context, cos_client):
appid = event['Records'][0]['cos']['cosBucket']['appid']
bucket = event['Records'][0]['cos']['cosBucket']['name'] + '-' + appid
key = event['Records'][0]['cos']['cosObject']['key']
key = key.replace('/' + str(appid) + '/' + event['Records'][0]['cos']['cosBucket']['name'] + '/', '', 1)
logger.info("Key is " + key)
res_bucket = result_bucket + '-' + appid
result_key = '/' + 'result_' + key.split('/')[-1]
return qcloud_reducer(cos_client, bucket, key, res_bucket, result_key)
def main_handler(event, context):
logger.info("start main handler")
if "Records" not in event.keys():
return {"errorMsg": "event is not come from cos"}
secret_id = "SecretId"
secret_key = "SecretKey"
config = CosConfig(Region=region, SecretId=secret_id, SecretKey=secret_key, )
cos_client = CosS3Client(config)
start_time = datetime.datetime.now()
res = reduce_caller(event, context, cos_client)
end_time = datetime.datetime.now()
print("data reducing duration: " + str((end_time - start_time).microseconds / 1000) + "ms")
if res == 0:
return "Data reducing SUCCESS"
else:
return "Data reducing FAILED"
部署與測試
通過Serverless Framework
的yaml
規範,編寫serveerless.yaml
:
WordCountMapper:
component: "@serverless/tencent-scf"
inputs:
name: mapper
codeUri: ./code
handler: index.main_handler
runtime: Python3.6
region: ap-guangzhou
description: 網站監控
memorySize: 64
timeout: 20
events:
- cos:
name: srcmr-1256773370.cos.ap-guangzhou.myqcloud.com
parameters:
bucket: srcmr-1256773370.cos.ap-guangzhou.myqcloud.com
filter:
prefix: ''
suffix: ''
events: cos:ObjectCreated:*
enable: true
WordCountReducer:
component: "@serverless/tencent-scf"
inputs:
name: reducer
codeUri: ./code
handler: index.main_handler
runtime: Python3.6
region: ap-guangzhou
description: 網站監控
memorySize: 64
timeout: 20
events:
- cos:
name: middlestagebucket-1256773370.cos.ap-guangzhou.myqcloud.com
parameters:
bucket: middlestagebucket-1256773370.cos.ap-guangzhou.myqcloud.com
filter:
prefix: ''
suffix: ''
events: cos:ObjectCreated:*
enable: true
完成之後,通過sls --debug
指令進行部署,成功之後進行基本的測試:
- 首先準備一個英文文檔:
- 登錄騰訊雲後臺,打開最初建立的存儲桶:srcmr,並上傳該文件;
- 上傳成功之後,稍等片刻就可以看到Reducer程序已經在Mapper執行之後,產出日誌:
此時,打開結果存儲桶,查看結果:
這樣,我們就完成了簡單的詞頻統計功能。
總結
相對來說,Serverless架構比較適合做大數據處理,在騰訊雲官網對Serverless應用場景的描述就包含有數據ETL處理:
一些數據處理系統中,常常需要週期性/計劃性地處理龐大的數據量。例如:證券公司每12小時統計一次該時段的交易情況並整理出該時段交易量 top 5,每天處理一遍秒殺網站的交易流日誌獲取因售罄而導致的錯誤從而分析商品熱度和趨勢等。雲函數近乎無限擴容的能力可以使您輕鬆地進行大容量數據的計算。我們利用雲函數可以對源數據併發執行多個 mapper 和 reducer 函數,在短時間內完成工作;相比傳統的工作方式,使用雲函數更能避免資源的閒置浪費從而節省資金。
通過本實例,希望讀者可以對Serverless架構的應用場景有更多的啓發,瞭解到Serverless不僅僅在監控告警方面有着很好的表現,在大數據領域也不甘落後。在實際生產中,每個項目都不會是單個函數單打獨鬥的,而是多個函數組合應用,形成一個Service體系,所以一鍵部署多個函數就顯得尤爲重要。