spark雜記:Execution plans, Lazy Evaluation, and caching

Task: calculate the sum of squares :

The standard (or busy) way to do this is

  1. Calculate the square of each element.
  2. Sum the squares.

This requires storing all intermediate results.

 

lazy evaluation:

  • postpone computing the square until result is needed.
  • No need to store intermediate results.
  • Scan through the data once, rather than twice.

 

Lazy Evaluation

Unlike a regular python program, map/reduce commands do not always perform any computation when they are executed. Instead, they construct something called an execution plan. Only when a result is needed does the computation start. This approach is also called lazy execution.

The benefit from lazy execution is in minimizing the the number of memory accesses. Consider for example the following map/reduce commands:

A=RDD.map(lambda x:x*x).filter(lambda x: x%2==0)
A.reduce(lambda x,y:x+y)  # 對沒有過濾掉的元素求和

The commands defines the following plan. For each number x in the RDD:

  1. Compute the square of x
  2. Filter out x*x whose value is odd.
  3. Sum the elements that were not filtered out.

A naive execution plan is to square all items in the RDD, store the results in a new RDD, then perform a filtering pass, generating a second RDD, and finally perform the summation. Doing this will require iterating through the RDD three times, and creating 2 interim RDDs. As memory access is the bottleneck in this type of computation, the execution plan is slow.

A better execution plan is to perform all three operations on each element of the RDD in sequence, and then move to the next element. This plan is faster because we iterate through the elements of the RDD only once, and because we don't need to save the intermediate results. We need to maintain only one variable: the partial sum, and as that is a single variable, we can use a CPU register.

For more on RDDs and lazy evaluation see here in the spark manual

Experimenting with Lazy Evaluation

Preparations

In the following cells we create an RDD and define a function which wastes some time and then returns cos(i). We want the function to waste some time so that the time it takes to compute the map operation is significant.

from pyspark import SparkContext
sc = SparkContext(master="local[4]")  #note that we set the number of workers to 3

We create an RDD with one million elements to amplify the effects of lazy evaluation and caching.

RDD=sc.parallelize(range(1000000))
It takes about 01.-0.5 sec. to create the RDD.
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 604 ms

Define a computation

The role of the function taketime is to consume CPU cycles.

from math import cos
def taketime(i):
    [cos(j) for j in range(100)]
    return cos(i)
taketime(1)

 

Time units

  • 1 second = 1000 Milli-second (ms)
  • 1 Millisecond = 1000 Micro-second (μs)
  • 1 Microsecond = 1000 Nano-second (ns)

Clock Rate

One cycle of a 3GHz cpu takes 13ns

taketime(1000) takes about 25 μs = 75,000 clock cycles.

The map operation.

Interm=RDD.map(lambda x: taketime(x))

How come so fast?

  • We expect this map operation to take 1,000,000 * 25 μs = 25 Seconds.
  • Why did the previous cell take just 29 μs?
  • Because no computation was done
  • The cell defined an execution plan, but did not execute it yet.

Lazy Execution refers to this type of behaviour. The system delays actual computation until the latest possible moment. Instead of computing the content of the RDD, it adds the RDD to the execution plan.

Using Lazy evaluation of a plan has two main advantages relative to immediate execution of each step:

  1. A single pass over the data, rather than multiple passes.
  2. Smaller memory footprint becase no intermediate results are saved.

Execution Plans

At this point the variable Interm does not point to an actual data structure. Instead, it points to an execution plan expressed as a dependence graph. The dependence graph defines how the RDDs are computed from each other.

At this point only the two left blocks of the plan have been declared.

Actual execution

The reduce command needs to output an actual output, spark therefor has to actually execute the map and the reduce. Some real computation needs to be done, which takes about 1 - 3 seconds (Wall time) depending on the machine used and on it's load.

print('out=',Interm.reduce(lambda x,y:x+y))

How come so fast? (take 2)

  • We expect this map operation to take 1,000,000 * 25 μs = 25 Seconds.
  • Map+reduce takes only ~4 second. 
  • Why?
  • Because we have 4 workers, rather than one.
  • Because the measurement of a single iteration of taketime is an overestimate.

Executing a different calculation based on the same plan.

The plan defined by Interm might need to be executed more than once.

Example: compute the number of map outputs that are larger than zero.

print('out=',Interm.filter(lambda x:x>0).count())

The price of not materializing

  • The run-time (3.4 sec) is similar to that of the reduce (4.4 sec).
  • Because the intermediate results in Interm have not been saved in memory (materialized)
  • They need to be recomputed.

The middle block: Map(Taketime) is executed twice. Once for each final step. (這時候就要後邊的緩存了)

Caching intermediate results

  • We sometimes want to keep the intermediate results in memory so that we can reuse them later without recalculating. * This will reduce the running time, at the cost of requiring more memory.
  • The method cache() indicates that the RDD generates in this plan should be stored in memory. Note that this is a plan to cache. The actual caching will be done only when the final result is needed.
Interm=RDD.map(lambda x: taketime(x)).cache()

By adding the Cache after Map(Taketime), we save the results of the map for the second computation.

Plan to cache

The definition of Interm is almost the same as before. However, the plan corresponding to Interm is more elaborate and contains information about how the intermediate results will be cached and replicated.

Note that PythonRDD[4] is now [Memory Serialized 1x Replicated]

 

Creating the cache

The following command executes the first map-reduce command and caches the result of the map command in memory.

print('out=',Interm.reduce(lambda x,y:x+y))

Using the cache

This time Interm is cached. Therefor the second use of Interm is much faster than when we did not use cache: 0.25 second instead of 1.9 second. (your milage may vary depending on the computer you are running this on).

print('out=',Interm.filter(lambda x:x>0).count())

Summary

  • Spark uses Lazy Evaluation to save time and space.
  • When the same RDD is needed as input for several computations, it can be better to keep it in memory, also called cache().
  • Next Video, Partitioning and Gloming

 

Partitioning and Gloming

  • When an RDD is created, you can specify the number of partitions.
  • The default is the number of workers defined when you set up SparkContext
A=sc.parallelize(range(1000000))
print(A.getNumPartitions())

4

We can repartition A into a different number of partitions.

D=A.repartition(10)
print(D.getNumPartitions())

10

We can also define the number of partitions when creating the RDD.

A=sc.parallelize(range(1000000),numSlices=10)
print(A.getNumPartitions())

10

Why is the #Partitions important?

  • They define the unit the executor works on.
  • You should have at least as pany partitions as workers.
  • Smaller partitions can allow more parallelization.

 

Repartitioning for Load Balancing

Suppose we start with 10 partitions, all with exactly the same number of elements

A=sc.parallelize(range(1000000))\
    .map(lambda x:(x,x)).partitionBy(10)
print(A.glom().map(len).collect())

  • Suppose we want to use filter() to select some of the elements in A.
  • Some partitions might have more elements remaining than others.
#select 10% of the entries
B=A.filter(lambda pair: pair[0]%5==0)
# get no. of partitions
print(B.glom().map(len).collect())

  • Future operations on B will use only two workers.
  • The other workers will do nothing,
    because their partitions are empty.
  • To fix the situation we need to repartition the RDD. 
  • One way to do that is to repartition using a new key.
  • The method .partitionBy(k) expects to get a (key,value) RDD where keys are integers. 
  • Partitions the RDD into k partitions.
  • The element (key,value) is placed into partition no.  key % k
C=B.map(lambda pair:(pair[1]/10,pair[1])).partitionBy(10) 
print(C.glom().map(len).collect())

Another approach is to use random partitioning using repartition(k)

  • An advantage of random partitioning is that it does not require defining a key.
  • disadvantage of random partitioning is that you have no control on the partitioning.
C=B.repartition(10)
print(C.glom().map(len).collect())

Glom()

  • In general, spark does not allow the worker to refer to specific elements of the RDD.
  • Keeps the language clean, but can be a major limitation.
  • glom() transforms each partition into a tuple (immutabe list) of elements.
  • Creates an RDD of tules. One tuple per partition.
  • workers can refer to elements of the partition by index. 
  • but you cannot assign values to the elements, the RDD is still immutable.
  • Now we can understand the command used above to count the number of elements in each partition.
  • We use glom() to make each partition into a tuple.
  • We use len on each partition to get the length of the tuple - size of the partition.
  • We collect the results to print them out.
print(C.glom().map(len).collect())

A more elaborate example

There are many things that you can do using glom()

Below is an example, can you figure out what it does?

def getPartitionInfo(G):
    d=0
    if len(G)>1: 
        for i in range(len(G)-1):
            d+=abs(G[i+1][1]-G[i][1]) # access the glomed RDD that is now a  list
        return (G[0][0],len(G),d)
    else:
        return(None)

output=B.glom().map(lambda B: getPartitionInfo(B)).collect()
print(output)

Summary

  • We learned why partitions are important and how to control them.
  • We Learned how glom() can be used to allow workers to access their partitions as lists.

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章