spark雜記：Execution plans, Lazy Evaluation, and caching

Task: calculate the sum of squares :

The standard (or busy) way to do this is

Calculate the square of each element.
Sum the squares.

This requires storing all intermediate results.

lazy evaluation:¶

postpone computing the square until result is needed.
No need to store intermediate results.
Scan through the data once, rather than twice.

Lazy Evaluation

Unlike a regular python program, map/reduce commands do not always perform any computation when they are executed. Instead, they construct something called an execution plan. Only when a result is needed does the computation start. This approach is also called lazy execution.

The benefit from lazy execution is in minimizing the the number of memory accesses. Consider for example the following map/reduce commands:

A=RDD.map(lambda x:x*x).filter(lambda x: x%2==0)
A.reduce(lambda x,y:x+y)  # 對沒有過濾掉的元素求和

The commands defines the following plan. For each number x in the RDD:

Compute the square of x
Filter out x*x whose value is odd.
Sum the elements that were not filtered out.

A naive execution plan is to square all items in the RDD, store the results in a new RDD, then perform a filtering pass, generating a second RDD, and finally perform the summation. Doing this will require iterating through the RDD three times, and creating 2 interim RDDs. As memory access is the bottleneck in this type of computation, the execution plan is slow.

A better execution plan is to perform all three operations on each element of the RDD in sequence, and then move to the next element. This plan is faster because we iterate through the elements of the RDD only once, and because we don't need to save the intermediate results. We need to maintain only one variable: the partial sum, and as that is a single variable, we can use a CPU register.

For more on RDDs and lazy evaluation see here in the spark manual

Experimenting with Lazy Evaluation¶

Preparations

In the following cells we create an RDD and define a function which wastes some time and then returns cos(i). We want the function to waste some time so that the time it takes to compute the map operation is significant.

from pyspark import SparkContext
sc = SparkContext(master="local[4]")  #note that we set the number of workers to 3

We create an RDD with one million elements to amplify the effects of lazy evaluation and caching.

RDD=sc.parallelize(range(1000000))

It takes about 01.-0.5 sec. to create the RDD.

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 604 ms

Define a computation

The role of the function taketime is to consume CPU cycles.

from math import cos
def taketime(i):
    [cos(j) for j in range(100)]
    return cos(i)

taketime(1)

Time units

1 second = 1000 Milli-second (ms)
1 Millisecond = 1000 Micro-second (μs)
1 Microsecond = 1000 Nano-second (ns)

Clock Rate

One cycle of a 3GHz cpu takes 13ns

taketime(1000) takes about 25 μs = 75,000 clock cycles.

The `map` operation.

Interm=RDD.map(lambda x: taketime(x))

How come so fast?

We expect this map operation to take 1,000,000 * 25 μs = 25 Seconds.

Why did the previous cell take just 29 μs?

Because no computation was done

The cell defined an execution plan, but did not execute it yet.

Lazy Execution refers to this type of behaviour. The system delays actual computation until the latest possible moment. Instead of computing the content of the RDD, it adds the RDD to the execution plan.

Using Lazy evaluation of a plan has two main advantages relative to immediate execution of each step:

A single pass over the data, rather than multiple passes.
Smaller memory footprint becase no intermediate results are saved.

Execution Plans

At this point the variable Interm does not point to an actual data structure. Instead, it points to an execution plan expressed as a dependence graph. The dependence graph defines how the RDDs are computed from each other.

At this point only the two left blocks of the plan have been declared.

Actual execution

The reduce command needs to output an actual output, spark therefor has to actually execute the map and the reduce. Some real computation needs to be done, which takes about 1 - 3 seconds (Wall time) depending on the machine used and on it's load.

print('out=',Interm.reduce(lambda x,y:x+y))

How come so fast? (take 2)

We expect this map operation to take 1,000,000 * 25 μs = 25 Seconds.
Map+reduce takes only ~4 second.
Why?

Because we have 4 workers, rather than one.
Because the measurement of a single iteration of taketime is an overestimate.

Executing a different calculation based on the same plan.

The plan defined by Interm might need to be executed more than once.

Example: compute the number of map outputs that are larger than zero.

print('out=',Interm.filter(lambda x:x>0).count())

The price of not materializing

The run-time (3.4 sec) is similar to that of the reduce (4.4 sec).
Because the intermediate results in Interm have not been saved in memory (materialized)
They need to be recomputed.

The middle block: Map(Taketime) is executed twice. Once for each final step. (這時候就要後邊的緩存了)

Caching intermediate results

We sometimes want to keep the intermediate results in memory so that we can reuse them later without recalculating. * This will reduce the running time, at the cost of requiring more memory.
The method cache() indicates that the RDD generates in this plan should be stored in memory. Note that this is a plan to cache. The actual caching will be done only when the final result is needed.

Interm=RDD.map(lambda x: taketime(x)).cache()

By adding the Cache after Map(Taketime), we save the results of the map for the second computation.

Plan to cache

The definition of Interm is almost the same as before. However, the plan corresponding to Interm is more elaborate and contains information about how the intermediate results will be cached and replicated.

Note that PythonRDD[4] is now [Memory Serialized 1x Replicated]

Creating the cache

The following command executes the first map-reduce command and caches the result of the map command in memory.

print('out=',Interm.reduce(lambda x,y:x+y))

Using the cache

This time Interm is cached. Therefor the second use of Interm is much faster than when we did not use cache: 0.25 second instead of 1.9 second. (your milage may vary depending on the computer you are running this on).

print('out=',Interm.filter(lambda x:x>0).count())

Summary

Spark uses Lazy Evaluation to save time and space.
When the same RDD is needed as input for several computations, it can be better to keep it in memory, also called cache().
Next Video, Partitioning and Gloming

Partitioning and Gloming¶

When an RDD is created, you can specify the number of partitions.
The default is the number of workers defined when you set up SparkContext

A=sc.parallelize(range(1000000))
print(A.getNumPartitions())

We can repartition A into a different number of partitions.

D=A.repartition(10)
print(D.getNumPartitions())

We can also define the number of partitions when creating the RDD.

A=sc.parallelize(range(1000000),numSlices=10)
print(A.getNumPartitions())

Why is the #Partitions important?

They define the unit the executor works on.
You should have at least as pany partitions as workers.
Smaller partitions can allow more parallelization.

Repartitioning for Load Balancing

Suppose we start with 10 partitions, all with exactly the same number of elements

A=sc.parallelize(range(1000000))\
    .map(lambda x:(x,x)).partitionBy(10)
print(A.glom().map(len).collect())

Suppose we want to use filter() to select some of the elements in A.
Some partitions might have more elements remaining than others.

#select 10% of the entries
B=A.filter(lambda pair: pair[0]%5==0)
# get no. of partitions
print(B.glom().map(len).collect())

Future operations on B will use only two workers.
The other workers will do nothing,
because their partitions are empty.

To fix the situation we need to repartition the RDD.
One way to do that is to repartition using a new key.

The method .partitionBy(k) expects to get a (key,value) RDD where keys are integers.
Partitions the RDD into k partitions.
The element (key,value) is placed into partition no. key % k

C=B.map(lambda pair:(pair[1]/10,pair[1])).partitionBy(10) 
print(C.glom().map(len).collect())

Another approach is to use random partitioning using repartition(k)

An advantage of random partitioning is that it does not require defining a key.
A disadvantage of random partitioning is that you have no control on the partitioning.

C=B.repartition(10)
print(C.glom().map(len).collect())

Glom()

In general, spark does not allow the worker to refer to specific elements of the RDD.
Keeps the language clean, but can be a major limitation.

glom() transforms each partition into a tuple (immutabe list) of elements.
Creates an RDD of tules. One tuple per partition.
workers can refer to elements of the partition by index.
but you cannot assign values to the elements, the RDD is still immutable.

Now we can understand the command used above to count the number of elements in each partition.
We use glom() to make each partition into a tuple.
We use len on each partition to get the length of the tuple - size of the partition.
We collect the results to print them out.

print(C.glom().map(len).collect())

A more elaborate example

There are many things that you can do using glom().

Below is an example, can you figure out what it does?

def getPartitionInfo(G):
    d=0
    if len(G)>1: 
        for i in range(len(G)-1):
            d+=abs(G[i+1][1]-G[i][1]) # access the glomed RDD that is now a  list
        return (G[0][0],len(G),d)
    else:
        return(None)

output=B.glom().map(lambda B: getPartitionInfo(B)).collect()
print(output)

Summary

We learned why partitions are important and how to control them.
We Learned how glom() can be used to allow workers to access their partitions as lists.

spark雜記：Execution plans, Lazy Evaluation, and caching

Task: calculate the sum of squares :

lazy evaluation:¶

Lazy Evaluation

Experimenting with Lazy Evaluation¶

Preparations

Define a computation

Time units

Clock Rate

The `map` operation.

How come so fast?

Execution Plans

Actual execution

How come so fast? (take 2)

Executing a different calculation based on the same plan.

The price of not materializing

Caching intermediate results

Plan to cache

Creating the cache

Using the cache

Summary

Partitioning and Gloming¶

Why is the #Partitions important?

Repartitioning for Load Balancing

Glom()

A more elaborate example

Summary

[Solr] Solr8.5.2 安裝、中文分詞以及定時更新索引

[solr] solr Similarity:切換不同相似度計算方法

[solr] solr5.5.2配置結巴分詞工具

[solr] solr 測試 (python、curl、界面 )

[文本糾錯] pycorrector框架測試

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

spark雜記：Execution plans, Lazy Evaluation, and caching

Task: calculate the sum of squares :

lazy evaluation:¶

Lazy Evaluation

Experimenting with Lazy Evaluation¶

Preparations

Define a computation

Time units

Clock Rate

The map operation.

How come so fast?

Execution Plans

Actual execution

How come so fast? (take 2)

Executing a different calculation based on the same plan.

The price of not materializing

Caching intermediate results

Plan to cache

Creating the cache

Using the cache

Summary

Partitioning and Gloming¶

Why is the #Partitions important?

Repartitioning for Load Balancing

Glom()

A more elaborate example

Summary

The `map` operation.