spark - Loading and Saving Data

原創

zjfzjf2012

2020-02-26 07:02

- File Formats

Text File

sc.textFile, load a text file

sc.wholeTextFiles, load multiple files (filename, entire content) under specified dir

JSON

sc.textFile.map to JSON object (people.add(mapper.readValue(line, Person.class))) by third-party tool, like jackson

sc.textFile.map to separated strings as array (val reader = new CSVReader(new StringReader(line));
reader.readNext();) by third-party tool, like opencsv

Sequence File

sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]).
map{case (x, y) => (x.toString, y.get())}

Object File

using Java Serialization. used for communicating between Spark Jobs

- Hadoop InputFormat and OutputFormat

val input = sc.hadoopFile[Text, Text, KeyValueTextInputFormat](inputFile).map{ //for hadoop old api
    case (x, y) => (x.toString, y.toString)
}

val input = sc.newAPIHadoopFile(inputFile, classOf[LzoJsonInputFormat], // for hadoop new api
classOf[LongWritable], classOf[MapWritable], conf)

- Others

hadoopDataset/saveAsHadoopDataset

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hadoop item based collaborative filtering use case

package org.mymahout.recommendation.hadoop; import java.io.File; import ja

LANWENBING

2020-07-07 19:12:53

kmeans job eclipse

從92行開始，腳本完成了一系列操作：提取語料庫中的文本保存到指定目錄，把指定目錄的文件轉換爲sequencefile，一氣呵成！實際上，mahout通過MapReduce執行的操作比上面的更詳細。深入分析這一過程，可以瞭

LANWENBING

2020-07-07 19:12:53

MapReduce Workflow

check output foldercalculate splitsapplication master gets progress and completion reports from tasks. it also requests

zjfzjf2012

2020-07-05 13:40:11

scala notes (3) - Files & Regular Expression, Trait, Operation and Function

- Files & Regular Expressionsread from file, url and string, remember to close sourceval source = Source.fromFile("myfi

zjfzjf2012

2020-07-05 13:40:01

MapReduce Features

- Counters (values are definitive only once job has successfully completed)Task CountersFilesystem CountersJob Counters

zjfzjf2012

2020-07-05 13:40:01

scala notes (5) - pattern and case class

- Pattern and Case Class ch match{ case _ if Character.isDigit(ch) => .. case '+' => ... case _ => ... }prefix match

zjfzjf2012

2020-07-05 13:40:00

scala type parameters

- type bounds class Pair[T <: Comparable[T]](val first: T, val second: T) { def smaller = if (first.compareTo(second)

zjfzjf2012

2020-07-05 13:40:00

scala notes (6) - Annotation, Future and Type Parameter

- Annotationclass MyContainer[@specialized T]def country: String @Localized@Test(timeout = 0, expected = classOf[org.ju

zjfzjf2012

2020-07-05 13:40:00

ElasticSearch 的 from+size、scroll、scroll-scan、sliced scroll-sacn、search after

參考文章：使用scroll實現Elasticsearch數據遍歷和深度分頁 Elasticsearch 5.x 源碼分析（3）from size, scroll 和 search after ElasticSearch官方文檔 Elas

itakyubi

2020-07-03 01:55:44

科普Spark，Spark是什麼，如何使用Spark

本文章可以解答以下問題： 1.Spark基於什麼算法的分佈式計算（很簡單） 2.Spark與MapReduce不同在什麼地方 3.Spark爲什麼比Hadoop靈活 4.Spark侷限是什麼 5.什麼情況下適合使用Spark 什

miller_lover

2020-07-02 17:43:39

核桃運算創辦人薛文蔚推出即時資料分析引擎

顛覆資料運算模式　　「我們現在還在用1945年所設計出的電腦運算模式!」(也就是範紐曼架構)薛文蔚解釋，以前記憶體很小，把資料當成外來物，程式和塬始碼是一等居民，支配資料。每次要運算時，都要把資料從硬碟內搬移至記憶體運算，再搬回硬碟

miller_lover

2020-07-02 17:43:36

大數據時代，我們需要有超越 Hadoop和MapReduce的殺手級技術

過去 25 年來，位居主流地位的關聯式資料庫（Relational Databases），在雲端計算與大數據的發展中突顯了其不足之處，所以在最近幾年受到了「NoSQL」發展的挑釁與威脅。像 Facebook、Google、Twi

miller_lover

2020-07-02 17:43:34

ECharts關於不同區間區域填充實現

ECharts關於不同區間區域填充實現實現的是分區域填充不同顏色的折線圖，比如說 “0-50”區間顯示綠色，“50以上爲紅色””。廢話不多說，先上圖：當然了，也可以設置多區間顯示， “0-50”區間顯示綠色，“50-100

今天也要敲代码鸭

2020-07-01 23:03:42

pig 部署

Linux ISO：CentOS-6.0-i386-bin-DVD.iso 32位 JDK version："1.6.0_25-ea" Hadoop software version：hadoop-0.20.205.0.tar.g

jgzd1124

2020-06-30 21:43:47

hadoop 大數據開發5 --僞分佈式hbase配置異常

異常1：2011-08-03 17:52:26,244 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9800, call getListing(/home/fi

jgzd1124

2020-06-30 21:43:47

24小時熱門文章

spark - Loading and Saving Data

MapReduce Workflow

scala notes (3) - Files & Regular Expression, Trait, Operation and Function

MapReduce Features

scala notes (5) - pattern and case class

scala type parameters

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結