PySpark-Recipes : I/O操作(txt, json, hdfs, csv...)

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('IO').getOrCreate()

read txt

  • textFile : only text content
  • wholeTextFiles : file path and text content
  • 通用的三個參數:
    • path
    • minPartitions
    • use_unicode : default False
data = spark.sparkContext.textFile('./chapter6/dataFiles/shakespearePlays.txt', 2)

data_list = data.collect()
data_list
["Love's Labour's Lost     ",
 "A Midsummer Night's Dream",
 'Much Ado About Nothing',
 'As You Like It']
data.count()
4
# 字符總長度
char_lenght = data.map(lambda x: len(x))

char_lenght.sum()
86
whole_data = spark.sparkContext.wholeTextFiles('./chapter6/dataFiles/shakespearePlays.txt',2)

whole_data_list = whole_data.collect()
whole_data_list
[('file:/home/wjh/PYspark/pyspark-recipes/code_mishra/chapter6/dataFiles/shakespearePlays.txt',
  "Love's Labour's Lost     \nA Midsummer Night's Dream\nMuch Ado About Nothing\nAs You Like It\n")]
whole_data.keys().collect()
['file:/home/wjh/PYspark/pyspark-recipes/code_mishra/chapter6/dataFiles/shakespearePlays.txt']
whole_data.values().collect()
["Love's Labour's Lost     \nA Midsummer Night's Dream\nMuch Ado About Nothing\nAs You Like It\n"]

write txt

data = spark.sparkContext.textFile('./chapter6/dataFiles/shakespearePlays.txt',4)
dataLineLength = data.map(lambda x : len(x))

dataLineLength.saveAsTextFile('./chapter6/dataFiles/save_rdd')

read direction

# Reading a directory using textFile() function.

>>> manyFilePlayData = sc.textFile('/home/pysparkbook/pysparkBookData/manyFiles',4)

>>> manyFilePlayData.collect()

# Reading a directory using wholeTextFiles() function.

>>> manyFilePlayDataKeyValue = sc.wholeTextFiles('/home/pysparkbook/pysparkBookData/manyFiles',4)
>>> manyFilePlayDataKeyValue.collect()

read hdfs

# Reading  Data from HDFS.
 

>>> filamentdata = sc.textFile('hdfs://localhost:9746/bookData/filamentData.csv',4)

save hdfs

>>> playData = sc.textFile('/home/muser/bData/shakespearePlays.txt',4)
>>> playDataLineLength = playData.map(lambda x : len(x))
>>> playDataLineLength.collect()

#  Saving RDD to HDFS.

>>> playDataLineLength.saveAsTextFile('hdfs://localhost:9746/savedData/')


###  hadoop fs -cat /savedData/part-00000

Reading Data from Sequential File

What is a sequential file?
A sequential file is one whose contents is stored in some order. It must always be read starting from the beginning of the file. This is opposed to a direct access file, whose contents can be retrieved without reading the entire file.
  • seqenceFile():
    • path
    • keyClass : indicates the key class of data in the sequence file
    • valueClass datatype of the values
>>> simpleRDD = sc.sequenceFile('hdfs://localhost:9746/sequenceFileToRead')
>>> simpleRDD.collect()

Write Data to a Sequential File

Data = [('si1','Python'),
        ('si3','Java'),
        ('si1','Java'),
        ('si2','Python'),
        ('si3','Ruby'),
        ('si4','C++'),
        ('si5','C'),
        ('si4','Python'),
        ('si2','Java')]


RDD = sc.parallelize(subjectsData, 4)

RDD.saveAsSequenceFile('hdfs://localhost:9746/sequenceFiles')

read csv file

import csv
import  io
def parseCSV(csvRow) :
    data = io.StringIO(csvRow)
    dataReader =  csv.reader(data)
    return([x for x in dataReader])

csvRow = "p,s,r,p"
parseCSV(csvRow)
[['p', 's', 'r', 'p']]

read json file

import json

def jsonParse(dataLine):
    parsedDict = json.loads(dataLine)
    valueData = parsedDict.values()
    return(valueData)

jsonData = '{"Time":"6AM",  "Temperature":15}'
jsonParsedData = jsonParse(jsonData)
print(jsonParsedData)
dict_values(['6AM', 15])
>>> tempData = sc.textFile("/home/pysparkbook//pysparkBookData/tempData.json",4)

>>> tempData.take(4)

# Creating paired RDD.
>>> tempDataParsed = tempData.map(jsonParse)
# write json format
def createJSON(data):
    dataDict = {}
    dataDict['Name'] = data[0]
    dataDict['Age'] = data[1]
    return(json.dumps(dataDict))

nameAgeList = ['Arun',22]

createJSON(nameAgeList)
'{"Name": "Arun", "Age": 22}'
>>> nameAgeData = [['Arun',22],
...                                  ['Bony',35],
...                                  ['Juna',29]]
>>> nameAgeRDD = sc.parallelize(nameAgeData,3)

>>> nameAgeRDD.collect()

>>> nameAgeJSON = nameAgeRDD.map(createJSON)
>>> nameAgeJSON.collect()
>>> nameAgeJSON.saveAsTextFile('/home/pysparkbook/jsonDir/')

Reading table data from HBase using PySpark

>>> hostName = 'localhost'

>>> tableName = 'pysparkBookTable'

>>> ourInputFormatClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
>>> ourKeyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
>>> ourValueClass='org.apache.hadoop.hbase.client.Result'
>>> ourKeyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
>>> ourValueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter'
>>> configuration = {}
>>> configuration['hbase.mapreduce.inputtable'] = tableName
>>> configuration['hbase.zookeeper.quorum'] = hostName

Now it is time to call the function newAPIHadoopRDD() with its arguments. 

>>> tableRDDfromHBase = sc.newAPIHadoopRDD(
...                        inputFormatClass = ourInputFormatClass,
...                        keyClass = ourKeyClass,
...                        valueClass = ourValueClass,
...                        keyConverter = ourKeyConverter,
...                        valueConverter = ourValueConverter,
...                        conf = configuration
...                     )


Let us see how our paired RDD  tableRDDfromHBase looks like. 

>>> tableRDDfromHBase.take(2)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章