Chaining
We can chain transformations and aaction to create a computation pipeline
Suppose we want to compute the sum of the squares:
where the elements xi are stored in an RDD.
#start the SparkContext
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext(master="local[4]")
print(sc)
<SparkContext master=local[4] appName=pyspark-shell>
Create an RDD
B=sc.parallelize(range(4))
B.collect()
[0, 1, 2, 3]
Sequential syntax for chaining
Perform assignment after each computation
Squares=B.map(lambda x:x*x)
Squares.reduce(lambda x,y:x+y)
14
Cascaded syntax for chaining
Combine computations into a single cascaded command
B.map(lambda x:x*x).reduce(lambda x,y:x+y)
14
Both syntaxes mean exactly the same thing
The only difference:
- In the sequential syntax the intermediate RDD has a name
Squares
- In the cascaded syntax the intermediate RDD is anonymous
The execution is identical!
Sequential execution
The standard way that the map and reduce are executed is
- perform the map
- store the resulting RDD in memory
- perform the reduce
Disadvantages of Sequential execution
- Intermediate result (
Squares
) requires memory space. - Two scans of memory (of
B
, then ofSquares
) - double the cache-misses.
Pipelined execution
Perform the whole computation in a single pass. For each element of B
- Compute the square
- Enter the square as input to the
reduce
operation.
Advantages of Pipelined execution
- Less memory required - intermediate result is not stored.
- Faster - only one pass through the Input RDD.
Lazy Evaluation
This type of pipelined evaluation is related to Lazy Evaluation. The word Lazy is used because the first command (computing the square) is not executed immediately. Instead, the execution is delayed as long as possible so that several commands are executed in a single pass.
The delayed commands are organized in an Execution plan
For more on Pipelined execution, Lazy evaluation and Execution Plans see spark programming guide/RDD operations
An instructive mistake
Here is another way to compute the sum of the squares using a single reduce command. Can you figure out how it comes up with this unexpected result?
C=sc.parallelize([1,1,2])
C.reduce(lambda x,y: x*x+y*y)
8
Answer:
reduce
first operates on the pair (1,1), replacing it with 12+12=2reduce
then operates on the pair (2,2), giving the final result 22+22=8
getting information about an RDD
RDD's typically have hundreds of thousands of elements. It usually makes no sense to print out the content of a whole RDD. Here are some ways to get manageable amounts of information about an RDD
Create an RDD of length n
which is a repetition of the pattern 1,2,3,4
n=1000000
B=sc.parallelize([1,2,3,4]*int(n/4))
#find the number of elements in the RDD
B.count()
100000
# get the first few elements of an RDD
print('first element=',B.first())
print('first 5 elements = ',B.take(5))
first element= 1
first 5 elements = [1, 2, 3, 4, 1]
Sampling an RDD¶
- RDDs are often very large.
- Aggregates, such as averages, can be approximated efficiently by using a sample.
- Sampling is done in parallel and requires limited computation.
The method RDD.sample(withReplacement,p)
generates a sample of the elements of the RDD. where
withReplacement
is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.p
is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.
# get a sample whose expected size is m
# Note that the size of the sample is different in different runs
m=5.
print('sample1=',B.sample(False,m/n).collect())
print('sample2=',B.sample(False,m/n).collect())
sample1= [4, 4, 4]
sample2= [2, 2]
Things to note and think about
- Each time you run the previous cell, you get a different estimate
- The accuracy of the estimate is determined by the size of the sample n∗p
- See how the error changes as you vary p
- Can you give a formula that relates the variance of the estimate to (p∗n) ? (The answer is in the Probability and statistics course).
filtering an RDD
The method RDD.filter(func)
Return a new dataset formed by selecting those elements of the source on which func returns true.
print('the number of elements in B that are > 3 =',B.filter(lambda n: n > 3).count())
the number of elements in B that are > 3 = 250000
Removing duplicate elements from an RDD
The method RDD.distinct()
Returns a new dataset that contains the distinct elements of the source dataset.
This operation requires a shuffle in order to detect duplication across partitions.
# Remove duplicate element in DuplicateRDD, we get distinct RDD
DuplicateRDD = sc.parallelize([1,1,2,2,3,3])
print('DuplicateRDD=',DuplicateRDD.collect())
print('DistinctRDD = ',DuplicateRDD.distinct().collect())
DuplicateRDD= [1, 1, 2, 2, 3, 3]
DistinctRDD = [1, 2, 3]
flatmap an RDD
The method RDD.flatMap(func)
is similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
text=["you are my sunshine","my only sunshine"]
text_file = sc.parallelize(text)
# map each line in text to a list of words
print('map:',text_file.map(lambda line: line.split(" ")).collect())
# create a single list of words by combining the words from all of the lines
print('flatmap:',text_file.flatMap(lambda line: line.split(" ")).collect())
map: [['you', 'are', 'my', 'sunshine'], ['my', 'only', 'sunshine']]
flatmap: ['you', 'are', 'my', 'sunshine', 'my', 'only', 'sunshine']
rdd1 = sc.parallelize([1, 1, 2, 3])
rdd2 = sc.parallelize([1, 3, 4, 5])
- union(other)
- Return the union of this RDD and another one.
- Note that that repetitions are allowed. The RDDs are bags not sets
- To make the result a set, use
.distinct
rdd2=sc.parallelize(['a','b',1])
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('union as bags =',rdd1.union(rdd2).collect())
print('union as sets =',rdd1.union(rdd2).distinct().collect())
rdd1= [1, 1, 2, 3]
rdd2= ['a', 'b', 1]
union as bags = [1, 1, 2, 3, 'a', 'b', 1]
union as sets = [1, 'a', 2, 3, 'b']
- intersection(other)
- Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.Note that this method performs a shuffle internally.
rdd2=sc.parallelize([1,1,2,5])
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('intersection=',rdd1.intersection(rdd2).collect())
rdd1= [1, 1, 2, 3]
rdd2= [1, 1, 2, 5]
intersection= [1, 2]
- subtract(other, numPartitions=None)
- Return each value in self that is not contained in other.
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('rdd1.subtract(rdd2)=',rdd1.subtract(rdd2).collect())
rdd1= [1, 1, 2, 3]
rdd2= [1, 1, 2, 5]
rdd1.subtract(rdd2)= [3]
- cartesian(other)
- Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.
rdd2=sc.parallelize([1,1,2])
rdd2=sc.parallelize(['a','b'])
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('rdd1.cartesian(rdd2)=\n',rdd1.cartesian(rdd2).collect())
rdd1= [1, 1, 2, 3]
rdd2= ['a', 'b']
rdd1.cartesian(rdd2)=
[(1, 'a'), (1, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')]
Summary
- Chaining: creating a pipeline of RDD operations.
- counting, taking and sampling an RDD
- More Transformations:
filter, distinct, flatmap
- Set transformations:
union, intersection, subtract, cartesian