spark雜記:Word Count

Word Count

Counting the number of occurances of words in a text is a popular first exercise using map-reduce.

The Task

Input: A text file consisisting of words separated by spaces.
Output: A list of words and their counts, sorted from the most to the least common.

We will use the book "Moby Dick" as our input.

#start the SparkContext
from pyspark import SparkContext
sc=SparkContext(master="local[4]")

# set import path
import sys
sys.path.append('../../Utils/')
#from plan2table import plan2table
def pretty_print_plan(rdd):
    for x in rdd.toDebugString().decode().split('\n'):
        print(x)

Download data file from S3

##If this cell fails, download the file from https://mas-dse-open.s3.amazonaws.com/Moby-Dick.txt on your browser
# and put it in the '../../Data/ directory
import requests
data_dir='../../Data'
filename='Moby-Dick.txt'
url = "https://mas-dse-open.s3.amazonaws.com/"+filename
local_path = data_dir+'/'+filename
!mkdir -p {data_dir}
# Copy URL content to local_path
r = requests.get(url, allow_redirects=True)
open(local_path, 'wb').write(r.content)

# check that the text file is where we expect it to be
!ls -l $local_path

Define an RDD that will read the file

  • Execution of read is lazy
  • File has been opened.
  • Reading starts when stage is executed.
text_file = sc.textFile(data_dir+'/'+filename)
type(text_file) 

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.6 s

Steps for counting the words

  • split line by spaces.
  • map word to (word,1)
  • count the number of occurances of each word.
words =     text_file.flatMap(lambda line: line.split(" "))
not_empty = words.filter(lambda x: x!='') 
key_values= not_empty.map(lambda word: (word, 1)) 
counts=     key_values.reduceByKey(lambda a, b: a + b)

CPU times: user 20 ms, sys: 10 ms, total: 30 ms
Wall time: 318 ms

flatMap()

Note the line:

words =     text_file.flatMap(lambda line: line.split(" "))

Why are we using flatMap, rather than map?

The reason is that the operation line.split(" ") generates a list of strings, so had we used map the result would be an RDD of lists of words. Not an RDD of words.

The difference between map and flatMap is that the second expects to get a list as the result from the map and it concatenates the lists to form the RDD.

The execution plan

In the last cell we defined the execution plan, but we have not started to execute it.

  • Preparing the plan took ~100ms, which is a non-trivial amount of time, 
  • But much less than the time it will take to execute it.
  • Lets have a look a the execution plan.

Understanding the details

To see which step in the plan corresponds to which RDD we print out the execution plan for each of the RDDs. 

Note that the execution plan for wordsnot_empty and key_values are all the same.

Execution

Finally we count the number of times each word has occured. Now, finally, the Lazy execution model finally performs some actual work, which takes a significant amount of time.

## Run #1
Count=counts.count()  # Count = the number of different words
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y) # 
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 10.5 ms, sys: 7.39 ms, total: 17.9 ms
Wall time: 1.04 s

Amortization

When the same commands are performed repeatedly on the same data, the execution time tends to decrease in later executions.

The cells below are identical to the one above, with one exception at Run #3

Observe that Run #2 take much less time that Run #1. Even though no cache() was explicitly requested. The reason is that Spark caches (or materializes) key_values, before executing reduceByKey() because performng reduceByKey requires a shuffle, and a shuffle requires that the input RDD is materialized. In other words, sometime caching happens even if the programmer did not ask for it.

## Run #2
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 20 ms, sys: 10 ms, total: 30 ms
Wall time: 3.9 s

Explicit Caching

In Run #3 we explicitly ask for counts to be cached. This will reduce the execution time in the following run by a little bit, but not by much.

## Run #3, cache
Count=counts.cache().count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))
Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 597 ms
#Run #4
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 20 ms, sys: 10 ms, total: 30 ms
Wall time: 432 ms

#Run #5
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 307 ms

Summary

This was our first real pyspark program, hurray!

Some things you learned:
1) An RDD is a distributed immutable array.
It is the core data structure of Spark is an RDD.
2) You cannot operate on an RDD directly. Only through Transformations and Actions.
3) Transformations transform an RDD into another RDD.
4) Actions output their results on the head node.
Slide Type
5) After the action is done, you are using just the head node, not the workers.

Lazy Execution:
1) RDD operations are added to an Execution Plan.
2) The plan is executed when a result is needed.
3) Explicit and implicit caching cause internediate results to be saved.

 

Finding the most common words

  • counts: RDD with 33301 pairs of the form (word,count)
  • Find the 5 most frequent words. 
  • Method1: collect and sort on head node.
  • Method2: Pure Spark, collect only at the end.

Method1: collect and sort on head node

Collect the RDD into the driver node

  • Collect can take significant time.
C=counts.collect()

CPU times: user 50 ms, sys: 0 ns, total: 50 ms
Wall time: 200 ms

Sort

  • RDD collected into list in driver node.
  • No longer using spark parallelism.
  • Sort in python
  • will not scale to very large documents.
C.sort(key=lambda x:x[1])
print('most common words\n'+'\n'.join(['%s:\t%d'%c for c in reversed(C[-5:])]))

most common words
the:    13766
of:    6587
and:    5951
a:    4533
to:    4510
CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 18.1 ms

Compute the mean number of occurances per word.

Count2=len(C)
Sum2=sum([i for w,i in C])
print('count2=%f, sum2=%f, mean2=%f'%(Count2,Sum2,float(Sum2)/Count2))

count2=33781.000000, sum2=215133.000000, mean2=6.368462

Method2: Pure Spark, collect only at the end.

  • Collect into the head node only the more frquent words.
  • Requires multiple stages

Step 1 split, clean and map to (word,1)

word_pairs=text_file.flatMap(lambda x: x.split(' '))\
    .filter(lambda x: x!='')\
    .map(lambda word: (word,1))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 66.3 µs

Step 2 Count occurances of each word.

counts=word_pairs.reduceByKey(lambda x,y:x+y)

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 37.5 ms

Step 3 Reverse (word,count) to (count,word) and sort by key

reverse_counts=counts.map(lambda x:(x[1],x[0]))   # reverse order of word and count
sorted_counts=reverse_counts.sortByKey(ascending=False)

CPU times: user 30 ms, sys: 10 ms, total: 40 ms
Wall time: 1.49 s

Full execution plan

We now have a complete plan to compute the most common words in the text. Nothing has been executed yet! Not even a single byte has been read from the file Moby-Dick.txt !

For more on execution plans and lineage see jace Klaskowski's blog

print('word_pairs:')
pretty_print_plan(word_pairs)
print('\ncounts:')
pretty_print_plan(counts)
print('\nreverse_counts:')
pretty_print_plan(reverse_counts)
print('\nsorted_counts:')
pretty_print_plan(sorted_counts)

Step 4 Take the top 5 words:

D=sorted_counts.take(5)
print('most common words\n'+'\n'.join(['%d:\t%s'%c for c in D]))

most common words
13766:    the
6587:    of
5951:    and
4533:    a
4510:    to
CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 398 ms

Summary

We showed two ways for finding the most common words:

  1. Collecting and sorting at the head node. -- Does not scale.
  2. Using RDDs to the end.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章