spark雜記：Spark Basics 2：Chaining，counting，transformations

Chaining

We can chain transformations and aaction to create a computation pipeline

Suppose we want to compute the sum of the squares:

where the elements xi are stored in an RDD.

#start the SparkContext
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext(master="local[4]")
print(sc)

<SparkContext master=local[4] appName=pyspark-shell>

Create an RDD

B=sc.parallelize(range(4))
B.collect()

[0, 1, 2, 3]

Sequential syntax for chaining

Perform assignment after each computation

Squares=B.map(lambda x:x*x)
Squares.reduce(lambda x,y:x+y)

Cascaded syntax for chaining

Combine computations into a single cascaded command

B.map(lambda x:x*x).reduce(lambda x,y:x+y)

Both syntaxes mean exactly the same thing

The only difference:

In the sequential syntax the intermediate RDD has a name Squares
In the cascaded syntax the intermediate RDD is anonymous

The execution is identical!

Sequential execution

The standard way that the map and reduce are executed is

perform the map
store the resulting RDD in memory
perform the reduce

Disadvantages of Sequential execution

Intermediate result (Squares) requires memory space.
Two scans of memory (of B, then of Squares) - double the cache-misses.

Pipelined execution

Perform the whole computation in a single pass. For each element of B

Compute the square
Enter the square as input to the reduce operation.

Advantages of Pipelined execution

Less memory required - intermediate result is not stored.
Faster - only one pass through the Input RDD.

Lazy Evaluation

This type of pipelined evaluation is related to Lazy Evaluation. The word Lazy is used because the first command (computing the square) is not executed immediately. Instead, the execution is delayed as long as possible so that several commands are executed in a single pass.

The delayed commands are organized in an Execution plan

For more on Pipelined execution, Lazy evaluation and Execution Plans see spark programming guide/RDD operations

An instructive mistake

Here is another way to compute the sum of the squares using a single reduce command. Can you figure out how it comes up with this unexpected result?

C=sc.parallelize([1,1,2])
C.reduce(lambda x,y: x*x+y*y)

Answer:

reduce first operates on the pair (1,1), replacing it with 12+12=2
reduce then operates on the pair (2,2), giving the final result 22+22=8

getting information about an RDD

RDD's typically have hundreds of thousands of elements. It usually makes no sense to print out the content of a whole RDD. Here are some ways to get manageable amounts of information about an RDD

Create an RDD of length n which is a repetition of the pattern 1,2,3,4

n=1000000
B=sc.parallelize([1,2,3,4]*int(n/4))

#find the number of elements in the RDD
B.count()

100000

# get the first few elements of an RDD
print('first element=',B.first())
print('first 5 elements = ',B.take(5))

first element= 1
first 5 elements = [1, 2, 3, 4, 1]

Sampling an RDD¶

RDDs are often very large.
Aggregates, such as averages, can be approximated efficiently by using a sample.
Sampling is done in parallel and requires limited computation.

The method RDD.sample(withReplacement,p) generates a sample of the elements of the RDD. where

withReplacement is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
p is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.

# get a sample whose expected size is m
# Note that the size of the sample is different in different runs
m=5.
print('sample1=',B.sample(False,m/n).collect()) 
print('sample2=',B.sample(False,m/n).collect())

sample1= [4, 4, 4]
sample2= [2, 2]

Things to note and think about

Each time you run the previous cell, you get a different estimate
The accuracy of the estimate is determined by the size of the sample n∗p
See how the error changes as you vary p
Can you give a formula that relates the variance of the estimate to (p∗n) ? (The answer is in the Probability and statistics course).

filtering an RDD

The method RDD.filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.

print('the number of elements in B that are > 3 =',B.filter(lambda n: n > 3).count())

the number of elements in B that are > 3 = 250000

Removing duplicate elements from an RDD

The method RDD.distinct() Returns a new dataset that contains the distinct elements of the source dataset.

This operation requires a shuffle in order to detect duplication across partitions.

# Remove duplicate element in DuplicateRDD, we get distinct RDD
DuplicateRDD = sc.parallelize([1,1,2,2,3,3])
print('DuplicateRDD=',DuplicateRDD.collect())
print('DistinctRDD = ',DuplicateRDD.distinct().collect())

DuplicateRDD= [1, 1, 2, 2, 3, 3]
DistinctRDD = [1, 2, 3]

flatmap an RDD

The method RDD.flatMap(func) is similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

text=["you are my sunshine","my only sunshine"]
text_file = sc.parallelize(text)
# map each line in text to a list of words
print('map:',text_file.map(lambda line: line.split(" ")).collect())
# create a single list of words by combining the words from all of the lines
print('flatmap:',text_file.flatMap(lambda line: line.split(" ")).collect())

map: [['you', 'are', 'my', 'sunshine'], ['my', 'only', 'sunshine']]
flatmap: ['you', 'are', 'my', 'sunshine', 'my', 'only', 'sunshine']

rdd1 = sc.parallelize([1, 1, 2, 3])
rdd2 = sc.parallelize([1, 3, 4, 5])

union(other)
- Return the union of this RDD and another one.
- Note that that repetitions are allowed. The RDDs are bags not sets
- To make the result a set, use .distinct

rdd2=sc.parallelize(['a','b',1])
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('union as bags =',rdd1.union(rdd2).collect())
print('union as sets =',rdd1.union(rdd2).distinct().collect())

rdd1= [1, 1, 2, 3]
rdd2= ['a', 'b', 1]
union as bags = [1, 1, 2, 3, 'a', 'b', 1]
union as sets = [1, 'a', 2, 3, 'b']

intersection(other)
- Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.Note that this method performs a shuffle internally.

rdd2=sc.parallelize([1,1,2,5])
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('intersection=',rdd1.intersection(rdd2).collect())

rdd1= [1, 1, 2, 3]
rdd2= [1, 1, 2, 5]
intersection= [1, 2]

subtract(other, numPartitions=None)
- Return each value in self that is not contained in other.

print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('rdd1.subtract(rdd2)=',rdd1.subtract(rdd2).collect())

rdd1= [1, 1, 2, 3]
rdd2= [1, 1, 2, 5]
rdd1.subtract(rdd2)= [3]

cartesian(other)
- Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

rdd2=sc.parallelize([1,1,2])
rdd2=sc.parallelize(['a','b'])
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print('rdd1.cartesian(rdd2)=\n',rdd1.cartesian(rdd2).collect())

rdd1= [1, 1, 2, 3]
rdd2= ['a', 'b']
rdd1.cartesian(rdd2)=
[(1, 'a'), (1, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')]

Summary

Chaining: creating a pipeline of RDD operations.
counting, taking and sampling an RDD
More Transformations: filter, distinct, flatmap
Set transformations: union, intersection, subtract, cartesian