[('file:/home/wjh/PYspark/pyspark-recipes/code_mishra/chapter6/dataFiles/shakespearePlays.txt',
"Love's Labour's Lost \nA Midsummer Night's Dream\nMuch Ado About Nothing\nAs You Like It\n")]
["Love's Labour's Lost \nA Midsummer Night's Dream\nMuch Ado About Nothing\nAs You Like It\n"]
write txt
data = spark.sparkContext.textFile('./chapter6/dataFiles/shakespearePlays.txt',4)
dataLineLength = data.map(lambda x :len(x))
dataLineLength.saveAsTextFile('./chapter6/dataFiles/save_rdd')
read direction
# Reading a directory using textFile() function.
>>> manyFilePlayData = sc.textFile('/home/pysparkbook/pysparkBookData/manyFiles',4)
>>> manyFilePlayData.collect()
# Reading a directory using wholeTextFiles() function.
>>> manyFilePlayDataKeyValue = sc.wholeTextFiles('/home/pysparkbook/pysparkBookData/manyFiles',4)
>>> manyFilePlayDataKeyValue.collect()
read hdfs
# Reading Data from HDFS.>>> filamentdata = sc.textFile('hdfs://localhost:9746/bookData/filamentData.csv',4)
save hdfs
>>> playData = sc.textFile('/home/muser/bData/shakespearePlays.txt',4)>>> playDataLineLength = playData.map(lambda x :len(x))>>> playDataLineLength.collect()# Saving RDD to HDFS.>>> playDataLineLength.saveAsTextFile('hdfs://localhost:9746/savedData/')### hadoop fs -cat /savedData/part-00000
Reading Data from Sequential File
What is a sequential file?
A sequential file is one whose contents is stored in some order. It must always be read starting from the beginning of the file. This is opposed to a direct access file, whose contents can be retrieved without reading the entire file.
seqenceFile():
path
keyClass : indicates the key class of data in the sequence file
import csv
import io
defparseCSV(csvRow):
data = io.StringIO(csvRow)
dataReader = csv.reader(data)return([x for x in dataReader])
csvRow ="p,s,r,p"
parseCSV(csvRow)