Overview
- 回顧python中的函數式編程
- python中的map和reduce函數
- 用map寫並行代碼
- Map-Reduce編程模型
- 用python寫spark程序
Reading
-
Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory.
-
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.
-
Chapters 1 and 3 of Karau, H., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly.
Functional programming
考慮以下代碼:
def double_everything_in(data):
result = []
for i in data:
result.append(2 * i)
return result
def quadruple_everything_in(data):
result = []
for i in data:
result.append(4 * i)
return result
double_everything_in([1, 2, 3, 4, 5])
[2, 4, 6, 8, 10]
quadruple_everything_in([1, 2, 3, 4, 5])
[4, 8, 12, 16, 20]
- 上述代碼沒有很好的踐行軟件工程中“不要重複自己”的原則。
- 應該如何避免重複呢?
def multiply_by_x_everything_in(x, data):
result = []
for i in data:
result.append(x * i)
return result
multiply_by_x_everything_in(2, [1, 2, 3, 4, 5])
[2, 4, 6, 8, 10]
multiply_by_x_everything_in(4, [1, 2, 3, 4, 5])
[4, 8, 12, 16, 20]
- 再考慮下面的代碼
def squared(x):
return x*x
def double(x):
return x*2
def square_everything_in(data):
result = []
for i in data:
result.append(squared(i))
return result
def double_everything_in(data):
result = []
for i in data:
result.append(double(i))
return result
square_everything_in([1, 2, 3, 4, 5])
[1, 4, 9, 16, 25]
double_everything_in([1, 2, 3, 4, 5])
[2, 4, 6, 8, 10]
- 應該如何避免重複呢
def apply_f_to_everything_in(f, data):
result = []
for x in data:
result.append(f(x))
return result
apply_f_to_everything_in(squared, [1, 2, 3, 4, 5])
[1, 4, 9, 16, 25]
apply_f_to_everything_in(double, [1, 2, 3, 4, 5])
[2, 4, 6, 8, 10]
- Lambda表達式:每次想用map的時候又不得不定義一個函數的時候可以用匿名函數。
apply_f_to_everything_in(lambda x: x*x, [1, 2, 3, 4, 5])
Python's map
function
python中有一個內置的mpo函數比我們自己寫的快很多。
map(lambda x: x*x, [1, 2, 3, 4, 5])
[1, 4, 9, 16, 25]
Implementing reduce
- reduce函數有一個fold的例子
- 有好幾種實現fold的方法
- 下面的方式叫做left fold
def foldl(f, data, z):
if (len(data) == 0):
print z
return z
else:
head = data[0]
tail = data[1:]
print "Folding", head, "with", tail, "using", z
partial_result = f(z, data[0])
print "Partial result is", partial_result
return foldl(f, tail, partial_result)
def add(x, y):
return x + y
foldl(add, [1, 2, 3, 4, 5], 0)
Folding 1 with [2, 3, 4, 5] using 0
Partial result is 1
Folding 2 with [3, 4, 5] using 1
Partial result is 3
Folding 3 with [4, 5] using 3
Partial result is 6
Folding 4 with [5] using 6
Partial result is 10
Folding 5 with [] using 10
Partial result is 15
15
- 用lambda表達式也一樣
foldl(lambda x, y: x + y, [1, 2, 3, 4, 5], 0)
Folding 1 with [2, 3, 4, 5] using 0
Partial result is 1
Folding 2 with [3, 4, 5] using 1
Partial result is 3
Folding 3 with [4, 5] using 3
Partial result is 6
Folding 4 with [5] using 6
Partial result is 10
Folding 5 with [] using 10
Partial result is 15
15
Python's reduce
function.
python內部的reduce函數是left foldreduce(lambda x, y: x + y, [1, 2, 3, 4, 5])
15
reduce(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)
-15
Functional programming and parallelism
- 函數式編程用在並行編程裏
- map函數可以通過數據級的並行化輕鬆實現並行計算
- 把函數作爲參數傳遞進去可以避免一些副作用
def perform_computation(f, result, data, i):
print "Computing the ", i, "th result..."
# This could be scheduled on a different CPU
result[i] = f(data[i])
def my_map(f, data):
result = [None] * len(data)
for i in range(len(data)):
perform_computation(f, result, data, i)
# Wait for other CPUs to finish, and then..
return result
my_map(lambda x: x * x, [1, 2, 3, 4, 5])
Computing the 0 th result...
Computing the 1 th result...
Computing the 2 th result...
Computing the 3 th result...
Computing the 4 th result...
[1, 4, 9, 16, 25]
A multi-threaded map
function
from threading import Thread
def schedule_computation_threaded(f, result, data, threads, i):
# Each function evaluation is scheduled on a different core.
def my_job():
print "Processing data:", data[i], "... "
result[i] = f(data[i])
print "Finished job #", i
print "Result was", result[i]
threads[i] = Thread(target=my_job)
def my_map_multithreaded(f, data):
n = len(data)
result = [None] * n
threads = [None] * n
print "Scheduling jobs.. "
for i in range(n):
schedule_computation_threaded(f, result, data, threads, i)
print "Starting jobs.. "
for i in range(n):
threads[i].start()
print "Waiting for jobs to finish.. "
for i in range(n):
threads[i].join()
print "All done."
return result
my_map_multithreaded(lambda x: x*x, [1, 2, 3, 4, 5])
Scheduling jobs..
Starting jobs..
Processing data: 1 ...
Finished job # 0
Result was 1
Processing data: 2 ...
Finished job # 1
Result was 4
Processing data: 3 ...
Finished job # 2
Result was 9
Processing data: 4 ...
Finished job # 3
Result was 16
Processing data: 5 ...
Finished job # 4
Result was 25
Waiting for jobs to finish..
All done.
[1, 4, 9, 16, 25]
from numpy.random import uniform
from time import sleep
def a_function_which_takes_a_long_time(x):
sleep(uniform(2, 10)) # Simulate some long computation
return x*x
my_map_multithreaded(a_function_which_takes_a_long_time, [1, 2, 3, 4, 5])
Scheduling jobs..
Starting jobs..
Processing data: 1 ...
Processing data: 2 ...
Processing data: 3 ...
Processing data: 4 ...
Processing data: 5 ...
Waiting for jobs to finish..
Finished job # 4
Result was 25
Finished job # 0
Result was 1
Finished job # 3
Result was 16
Finished job # 2
Result was 9
Finished job # 1
Result was 4
All done.
Out[31]:
[1, 4, 9, 16, 25]
Map Reduce
- map reduce是一種大規模並行處理的編程模型
- 大規模意味着它可以藉助大量的計算集羣來處理大數據
- 有很多實現:hadoop和spark
- 我們可以用任何語言實現map-reduce:hadoop裏用java,spark用scala,但也有python接口
- python或者scala非常適合map-reduce模型,但我們不必函數式編程
- mapreduce的實現中關注了底層的功能操作,我們不必擔心。
Typical steps in a Map Reduce Computation
- ETL一個數據集
- Map操作:每一行提取你關心的信息
- "Shuffle and Sort":task/node allocation
- Reduce操作:aggregate、summaries、filter or transform
- 保存結果
Callbacks for Map Reduce
- 數據集和每一步的計算狀態,都以鍵值對的形式表現
- map(k,v)→⟨k′,v′⟩∗
- reduce(k′,⟨k′,v′⟩∗)→⟨k′,v″⟩∗
- *指的是值的collection
- colletions並不是有序的
Resilient Distributed Data
- 在map-reduce計算中這些collections被稱作RDDs:
- 數據在分佈在節點之間
- 單個節點的損壞不會導致數據丟失
- 數據一般存在HBase或者HDFS中
- map和reduce函數可以不同的keys、elements實現並行化
Word Count Example
- 在這個例子裏,輸入是一系列URL,每一個記錄篇一篇文檔
- 問題:在數據集中每個單詞出現了多少次
Word Count: Map
- 輸入數據進行map:
- Key: URL
- Value: 文檔內容
- 我們需要用map處理給定的URL
- Key: word
- Value: 1
- 我們原始的數據集會被轉化成:
⟨to,1⟩ ⟨be,1⟩ ⟨or,1⟩ ⟨not,1⟩ ⟨to,1⟩ ⟨be,1⟩
Word Count: Reduce
- reduce操作按照key來對values進行分組,然後執行在每個key上執行reduce。
- mapreduce會摺疊計算數算來最小化數據複製的操作。
- 不同分區的數據單獨進行reduce
- 算子的選擇是很重要的,需要累加和結合。
- 在這個例子是函數是+運算符
⟨be,2⟩
⟨not,1⟩
⟨or,1⟩
⟨to,2⟩
⟨not,1⟩
⟨or,1⟩
⟨to,2⟩
MiniMapReduce
- 爲了理解map-reduce編程模型是如何運作的,我們在python裏實現一個自己的map-reduce框架
- 但這並不是hadoop或者spark的實際實現方式
##########################################################
#
# MiniMapReduce
#
# A non-parallel, non-scalable Map-Reduce implementation
##########################################################
def groupByKey(data):
result = dict()
for key, value in data:
if key in result:
result[key].append(value)
else:
result[key] = [value]
return result
def reduceByKey(f, data):
key_values = groupByKey(data)
return map(lambda key:
(key, reduce(f, key_values[key])),
key_values)
Word-count using MiniMapReduce
data = map(lambda x: (x, 1), "to be or not to be".split())
data
[('to', 1), ('be', 1), ('or', 1), ('not', 1), ('to', 1), ('be', 1)]
groupByKey(data)
{'be': [1, 1], 'not': [1], 'or': [1], 'to': [1, 1]}
reduceByKey(lambda x, y: x + y, data)
[('not', 1), ('to', 2), ('or', 1), ('be', 2)]
Parallelising MiniMapReduce
我們可以輕鬆的把剛纔的map-reduce實現改成並行框架,利用剛纔的my_map_mutithreaded函數就可以。def reduceByKey_multithreaded(f, data):
key_values = groupByKey(data)
return my_map_multithreaded(
lambda key: (key, reduce(f, key_values[key])), key_values.keys())
reduceByKey_multithreaded(lambda x, y: x + y, data)
Scheduling jobs..
Starting jobs..
Processing data: not ...
Finished job # 0
Result was ('not', 1)
Processing data: to ...
Finished job # 1
Result was ('to', 2)
Processing data: or ...
Finished job # 2
Result was ('or', 1)
Processing data: be ...
Finished job # 3
Result was ('be', 2)
Waiting for jobs to finish..
All done.
[('not', 1), ('to', 2), ('or', 1), ('be', 2)]
Parallelising the reduce step
- 假設我們的算子事累加和結合的,我們也可以並行reduce操作
- 把數據大概分成相等的子集
- 在單獨的計算核心上獨立reduce每一個子集
- 最後把各個結果組合起來
Partitioning the data¶
def split_data(data, split_points):
partitions = []
n = 0
for i in split_points:
partitions.append(data[n:i])
n = i
partitions.append(data[n:])
return partitions
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
partitioned_data = split_data(data, [3])
partitioned_data
Reducing across partitions in parallel
from threading import Thread
def parallel_reduce(f, partitions):
n = len(partitions)
results = [None] * n
threads = [None] * n
def job(i):
results[i] = reduce(f, partitions[i])
for i in range(n):
threads[i] = Thread(target = lambda: job(i))
threads[i].start()
for i in range(n):
threads[i].join()
return reduce(f, results)
parallel_reduce(lambda x, y: x + y, partitioned_data)
Apache Spark and Map-Reduce
- 我們可以用高層函數map一個RDDs到一個新的RDDs.
- 每個RDD的實例都至少有兩個相應的MapRuduce工作流:map , reduceByKey
- 這些方法和我們之間定義的標準python collections工作原理是相同的
- 在Apache Spark API中還有額外的RDD方法
Word-count in Apache Spark
words = "to be or not to be".split()
The SparkContext
class
- 當我們用Spark的時候我們需要初始化一個SparkContext.
- paralleled 方法在SparkContext中可以把任何python collection轉成RDD
- 通常情況下我們通過一個大文件或者HBase表中創建RDD
words_rdd = sc.parallelize(words)
words_rdd
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423
Mapping an RDD
現在當我們在my_rdd上面執行map或者reduceByKey操作的時候可以在集羣上建立一個並行計算的任務。word_tuples_rdd = words_rdd.map(lambda x: (x, 1))
word_tuples_rdd
PythonRDD[1] at RDD at PythonRDD.scala:43
- 注意我們現在還沒有產生結果
- 計算操作在我們請求最終結果被collect之間是不會執行的
- 通過collect()方法來激活這個計算
word_tuples_rdd.collect()
[('to', 1), ('be', 1), ('or', 1), ('not', 1), ('to', 1), ('be', 1)]
Reducing an RDD
- 但是,我們需要額外處理
word_counts_rdd = word_tuples_rdd.reduceByKey(lambda x, y: x + y)
word_counts_rdd
PythonRDD[6] at RDD at PythonRDD.scala:43
- 現在來請求最終的結果
word_counts = word_counts_rdd.collect()
word_counts
[('not', 1), ('to', 2), ('or', 1), ('be', 2)]
Lazy evaluation
- 只有當我們進行collect()的時候集羣纔會進行計算
- collect() 會同時激活map和reduceByKey操作
- 如果結果collection非常大,那麼這個操作開銷是很大的
The head of an RDD
- take方法和collect類似,但是隻返回前n個元素。
- take在測試的時候非常有用
word_counts_rdd.take(2)
[('not', 1), ('to', 2)]
The complete word-count example
text = "to be or not to be".split()
rdd = sc.parallelize(text)
counts = rdd.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)
counts.collect()
[('not', 1), ('to', 2), ('or', 1), ('be', 2)]
Additional RDD transformations
spark提供了很多額外的collections上的操作
- Sorting:
sortByKey
,sortBy
,takeOrdered
- Mapping:
flatMap
- Filtering:
filter
- Counting:
count
- Set-theoretic:
intersection
,union
Creating an RDD from a text file
- 上面的例子例,我們從collection對象中創建了一個RDD
- 這不是典型的處理大數據的方法
- 更常用的方式直接從HDFS文件或者HBase表中創建RDD
- 下面的例子將會從一個純ext4文件系統中創建RDD
- 每一個RDD對應了文本中的一行
genome = sc.textFile('/tmp/genome.txt')
Genome example
- 我們將會中這個RDD進行計算,並且根據詞頻來進行排序
- 首先我們定義一個函數,把序列切成指定大小
def group_characters(line, n=5):
result = ''
i = 0
for ch in line:
result = result + ch
i = i + 1
if (i % n) == 0:
yield result
result = ''
def group_and_split(line):
return [sequence for sequence in group_characters(line)]
group_and_split('abcdefghijklmno')
['abcde', 'fghij', 'klmno']
- 現在我們要把原始的RDD轉換成鍵值對的形式,key是這個序列,value是1
- 注意如果我們簡單的把每一行進行map,會得到一個高維的數據
genome.map(group_and_split).take(2)
[[u'CAGGG',
u'GCACA',
u'GTCTC',
u'GGCTC',
u'ACTTC',
u'GACCT',
u'CTGCC',
u'TCCCC',
u'AGTTC',
u'AAGTG',
u'ATTCT',
u'CCTGC',
u'CTCAG',
u'TCTCC'],
[u'TGAGT',
u'AGCTG',
u'GGATG',
u'ACAGG',
u'AGTGG',
u'AGCAT',
u'GCCTA',
u'GCTAA',
u'TCTTT',
u'GTATT',
u'TCTAG',
u'TAGAG',
u'ATGCG',
u'GTTTT']]
Flattening an RDD using flatMap
- 我們需要把數據轉成序列形式的,所以使用flatMap方法
sequences = genome.flatMap(group_and_split)
sequences.take(3)
[u'CAGGG', u'GCACA', u'GTCTC']
counts = \
sequences.map(
lambda w: (w, 1)).reduceByKey(lambda x, y: x + y)
counts.take(10)
[(u'TGTCA', 1),
(u'GCCCA', 3),
(u'CCAAG', 5),
(u'GCCCC', 4),
(u'CATGT', 1),
(u'AGATT', 1),
(u'TGTTT', 1),
(u'CCTAT', 4),
(u'TCAGT', 1),
(u'CAGCG', 2)]
- 我們根據計數隊序列進行排序
- 因此key(第一個元素)應該是計數值
- 我們需要顛倒一下tuples的順序
def reverse_tuple(key_value_pair):
return (key_value_pair[1], key_value_pair[0])
sequences = counts.map(reverse_tuple)
sequences.take(10)
[(1, u'TGTCA'),
(3, u'GCCCA'),
(5, u'CCAAG'),
(4, u'GCCCC'),
(1, u'CATGT'),
(1, u'AGATT'),
(1, u'TGTTT'),
(4, u'CCTAT'),
(1, u'TCAGT'),
(2, u'CAGCG')]
Sorting an RDD
- 現在我們可以降序的方法對key進行排序
sequences_sorted = sequences.sortByKey(False)
top_ten_sequences = sequences_sorted.take(10)
top_ten_sequences
[(15, u'AAAAA'),
(9, u'GCAGG'),
(8, u'ACAAA'),
(7, u'GGCCA'),
(7, u'AATTA'),
(7, u'AGGTT'),
(7, u'AGGGA'),
(7, u'CCAGG'),
(7, u'GAGCC'),
(7, u'AAAAC')]