Spark學習筆記（五）

原創

2020-02-22 04:48

MLlib for Spark

K-means
1.K-means (scala)

// Load and parse the data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(_.split(‘     ').map(_.toDouble)).cache()
// Cluster the data into five classes using KMeans.
val clusters = KMeans.train(parsedData, 5, numIterations = 20)
!
// Compute the sum of squared errors.
val cost = clusters.computeCost(parsedData)
println("Sum of squared errors = " + cost)

2.K-means (python)

# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 5, maxIterations = 20,runs = 1,initialization_mode = "kmeans||")

# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))

降維+K-means

// compute principal components
val points: RDD[Vector] = ...
val mat = RowMatrix(points)
val pc = mat.computePrincipalComponents(20)
// project points to a low-dimensional space
val projected = mat.multiply(pc).rows
// train a k-means model on the projected data
val model = KMeans.train(projected, 10)

Streaming + MLlib

// collect tweets using streaming
// train a k-means model
val model: KMmeansModel = ...
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
// print tweets within this particular cluster
filteredTweets.print()

協同過濾
目標：從其條目的子集中恢復矩陣。（再理解。）

Collaborative filtering
// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val numIterations = 20
val rank = 10
val regularizer = 0.01
val model = ALS.train(ratings, rank, numIterations, regularizer)

// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)

Spark Streaming

What：擴展Spark用於進行大數據流處理

why

許多大數據應用程序需要實時處理大數據流：網站監控、欺詐檢測、廣告獲利
Advantage
縮放到數百個節點
實現低延遲
從故障中有效恢復
集成批處理和交互式處理

現有的流系統
Storm

如果節點未處理，則重播記錄
處理每個記錄至少一次
可以更新可變狀態兩次！
可變狀態可能會由於失敗而丟失！

Trident

使用事務更新狀態
每個記錄只處理一次
對外部數據庫的每狀態事務很慢

Spark Streaming
將流計算作爲一系列非常小的確定性批處理作業運行

將實況流切成X秒的批次
Spark將每個批處理的數據作爲RDD，使用RDD操作進行處理
最後，RDD操作的處理結果分批返回

編程模型 - DStream
離散流（DStream）
- 表示數據流
- 實現爲一系列RDD

示例 - 從Twitter獲取標籤

val ss = new StreamingContext(sparkContext,Seconds(1))
val tweets = TwitterUtils.createStream(ssc,auth)

Input DStream：tweets

val tweets = TwitterUtils.createStream(ssc,None)
val hashTags =tweets.flatMap(status=>getTags(status))

transformed DStream：hashTags
transformation：flatMap

hashTags.saveAsHadoopFiles(“hdfs://…”)

output operation:saveAsHadoopFiles

hashTags.foreachRDD(hashTagRDD=>{…})

1.指定函數以根據先前狀態和新數據生成新狀態
示例：將每個用戶的心情保持爲狀態，並使用他們的tweets進行更新
def updateMood(newTweets,lastMood)=>newMood
val moods=tweetsByUser.updateStateByKey(updateMood_)

2.混合RDD和DStream操作
- 示例：使用垃圾郵件HDFS文件加入傳入的tweets，以過濾掉不正確的tweets
tweets.transform(tweetsRDD=>tweetsRDD.join(spamFile).filter(…)})

3.混合RDD和DStream

將實時數據流與歷史數據組合
- 使用Spark等生成歷史數據模型
- 使用數據模型處理實時數據流
將流與MLlib，GraphX algos組合
- 離線學習，在線預測
- 在線學習和預測
使用SQL查詢流數據
- select * from table_from_streaming_data

4.統一棧的優點

以交互方式探索數據以識別問題
在Spark中使用相同的代碼來處理大型日誌
在Spark Streaming中使用類似的代碼進行實時處理

5.容錯

輸入數據批次會複製到內存中以實現容錯
由於工作程序失敗導致的數據丟失可以從複製的輸入數據重新計算
所有的變換都是容錯的，一次性的變換

6.輸入源
Kafka，Flume，Akka Actors，原始TCP套接字，HDFS等。

a_victory

發佈了25 篇原創文章 · 獲贊 4 · 訪問量 2萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark學習筆記（五）

MLlib for Spark

Spark Streaming

SpringMVC

Spark學習筆記（五）

CentOS上配置ssh區別於ubuntu的地方

Spark學習筆記（四）

Spark學習筆記（三）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結