[翻譯]用.NET平臺的Spark擴展包Mobius來開發Apache Spark應用

最近一個大數據項目中,用到數據挖掘,我一直用C#開發的,還不會JAVA,試圖用C#進行SPARK開發,發現網上有這個微軟的Mobius項目,但中文資料很少。也許是我還沒查到吧。就花幾個小時翻譯一下。後面有必要的話,還繼續翻譯相關內容。
第一次翻譯,不對不好的地方,請各位擔待一些。出於嚴謹性,保留了英文。
============================================================
Developing Apache Spark Applications in .NET using Mobius
Extending Spark to the .NET developer community
用.NET平臺的Spark擴展包Mobius來開發Apache Spark應用


 by Kaarthik Sivashanmugam Posted in COMPANY BLOGAugust 3, 2016

作者:Kaarthik (微軟Mobius項目成員之一),2016-8-3


To address the gap between Spark and .NET, Microsoft created Mobius, an open source project, with guidance from Databricks. By adding the C# language API to Spark, it extends and enables .NET framework developers to build Apache Spark Applications. This guest blog provides an overview of this C# API.

爲了彌補Spark和.NET平臺之間的空缺,微軟在Databricks公司(Spark的商業文化公司,百度)的指導下,創建了一個開源項目Mobius。通過增加了Spark的C#語言API,.NET FRAMEWORKS的開發者構建能夠擴展構建Apache Spark應用。本篇博客對這個C# API進行簡述。

Apache Spark has transformed the big data processing and analytics space over the last few years. It provides high-level APIs in Scala, Java, Python and R and dramatically reduced the cost and complexity of building a wide variety of big data workloads. The results of Spark Survey 2015 indicate that the ease of programming is one of the most important aspects of Spark. So it is apparent that having APIs in multiple languages appealed to various developer persona and contributed to the rapid adoption of Spark.

過去的幾年裏,Apache Spark已經能進行大數據處理和分析。它提供了Scala,Java,Python和R語言的高級API接口,並且明顯的減少了構建各類大數據工作的成本和複雜度。2015年Spark調查表明,對於Spark,最重要方面之一就是編程的容易性。因此,很顯然,支持多語言的API對各類個人開發者而言是及其有吸引力的,也有利於人們快速接納Spark。

However, Spark had been out of reach for the .NET developer community. The results of Spark Survey 2015 also indicated that there was huge spike in the Spark usage in Windows, and there is a high likelihood that a good portion of the developers using Spark in Windows are .NET professionals. To address the gap between Spark and .NET, Microsoft created Mobius as an open source project with the goal of adding a C# language API to Spark enabling the usage of any .NET Framework language in building Apache Spark applications. With Mobius, organizations deeply invested in .NET can reuse their existing .NET libraries in their Spark applications.

然而,Spark沒有進入.NET開發人員社區。2015年的調查結果也表明,在windows操作系統中使用Spark存在巨大的障礙,而且在windows中使用Spark的開發人員很大一部分是.NET專業人員。爲了填補Spark和.NET之間的缺口,微軟公司創建了開源響應Mobius,目標就是增加Spark的C#語言API,使得任何.NET Framework語言可以用於構建Apache Apark應用程序。通過Mobius項目,大量投資在.NET上的組織可以重用他們的.NET庫到Spark應用程序中。


Spark Applications in .NET

.NET開發的Spark應用程序

The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.

C#語言捆綁到Spark就像Python和R語言的綁定一樣。事實上,Mobius遵循相同的設計模式,在一致性和重用性方面都是利用了現有的語言綁定組件實現。下面的圖說明了Mobius中,.NET應用程序和C# API之間的依賴性,它內部依賴於Spark的公開Scala和Java API,也依賴於從PySpark擴展PythonRDD,從而實現CSharpRDD。


 
As shown above, the driver programs are written entirely in a .NET programming language like C# or F# using the C# API in Mobius. Mobius applications can be used with Spark deployed on premises or in the cloud. Mobius is supported on Windows and Linux. In Linux, Mobius uses Mono, an open source implementation of the .NET framework.

如上所述,在Mobius中,驅動程序全部有.NET編程語言,如C#,或F#來編寫的,Mobius應用程序能和Spark一起部署在本地環境或雲端。Mobius支持Windows和Linux操作系統。在Linux操作系統中,Mobius使用Mono這個.NET framework的開源實現平臺來支持。

Developing & Submitting Mobius Applications

開發和提交Mobius應用程序
Mobius driver applications can be developed in an IDE (like Visual Studio) that supports .NET development. Mobius API and the worker implementation (used to execute user defined functionality in C# code in Spark worker nodes) are released to NuGet. Once these Mobius binaries and any other .NET library dependencies are added to the Mobius driver project in the IDE, the driver application code can be developed, debugged and tested like any other .NET program within the IDE.

Mobius驅動程序能在支持.NET開發的集成開發環境(如Visual Studio)中進行開發。Mobius API和工作者實現(用於執行用戶定義的功能,用C#代碼編寫Spark工作者節點程序)都是發佈給NuGet的。在集成開發環境(IDE)中,一旦這些Mobius二進制文件和任何其他依賴的.NET庫添加到Mobius驅動項目中,驅動應用程序代碼就能像IDE中的其他任意.NET程序一樣,被開發,調試和測試。

Mobius driver applications in .NET are compiled into an executable (.exe file), which is copied along with its dependencies to the client machine from which Spark job needs to be submitted. A supported version of Mobius release is also needed on the client machine on which Mobius job submission script (sparkclr-submit.cmd or sparkclr-submit.sh) is used to submit Mobius-based application to a Spark cluster. A Mobius job submission script accepts the same parameters as a spark-submit script, but it also needs an additional parameter for specifying the Mobius driver executable name and its path. As shown above, the driver programs are written entirely in a .NET programming language like C# or F# using the C# API in Mobius.

.NET Mobius驅動應用程序被編譯爲一個可執行程序(.EXE文件),它將連同依賴的文件一起被拷貝到Spark任務需要提交的客戶機中。在提交一個基於Mobius的應用程序到Spark集羣的Mobius子任務腳本(如sparkclr-submit.cmd 或 sparkclr-submit.sh)的客戶機中,需要一個Mobius發佈版本的支持。一個Mobius子任務腳本接收的參數和Spark提交任務腳本一樣,但它也需要一個額外參數來說明Mobius驅動程序的可執行文件文件的名稱和路徑。正如前面所述,在Mobius中,驅動程序完全由.NET編程語言如C#或F#通過C# API來編寫。

More information on running a Mobius application is available at on GitHub.

在GitBub網站上,有更多關於運行一個Mobius應用程序的信息。

The Mobius API has the same method names and signatures with similar data types as the Scala API for Spark. As a result, the driver programs implemented using Mobius look similar to those implemented in Scala or Java. Here is a code example for implementing Spark’s “Word Count” example in C# using Mobius API.

Mobius API有與Spark Scala API一樣方法名稱和相同數據類型的簽名。結果,使用Mobius的驅動程序編程實現,看起來也與Scala或java的實現相似。這裏用C#語言和Mobius API來舉一個例子說明Spark中的“Word Count”例子的實現。

var textFile = sparkContext.TextFile(@"hdfs://...");
var counts = textFile
             .FlatMap(x => x.Split(' '))
             .Map(word => new KeyValuePair<string, int>(word, 1))
             .ReduceByKey((x, y) => x + y)
             .Map(wordCount => $"{wordCount.Key},{wordCount.Value}");
counts.SaveAsTextFile(@"hdfs://...");


The code snippet below is in F# and shows how to query the data in JSON format and use the DataFrame API to look for rows with State = ‘California’ and also register those rows as a temp table and use Spark SQL to query for all rows with name = ‘Bill’.

下面的代碼片段是用F#語言寫的,說明了如何使用JSON格式查詢數據的和使用DataFrame API來查找帶有State = ‘California’ 的行,並且註冊這些行爲一個臨時表,使用Spark SQL來查詢所有帶條件:name = ‘Bill’的行。

let peopleDataFrame = sqlContext.Read().Json("hdfs://...")
let filteredDf = peopleDataFrame.Select("name", "address.state")
                 .Where("state = 'California'")
filteredDf.Show()
filteredDf.RegisterTempTable "filteredDfAsTempTable"
let countAsDf = sqlContext.Sql "SELECT * FROM filteredDfAsTempTable where name='Bill'"
let countOfRows = countAsDf.Count()
printf "Count of rows with name='Bill' and State='California' = %d" countOfRows

More examples for RDD, DataFrame and DStream API are available here. These examples also cover HDFS, Cassandra, Event Hubs, Kafka, Hive and JDBC sources in Mobius applications.

更多RDD,DataFrame和DStream API這裏也都有,這些例子都是Mobius應用程序,都是涵蓋了HDFS,Cassandra,Event Hubs,Kafka,Hive和JDBC方面內容的源碼

More Resources

更多資源

You can peruse our GitHub repository, and we welcome your contributions. Additional information on Mobius is available in the slides and video from the talk on Mobius presented at  Spark Summit 2016. Finally, Mobius powers several .NET-based Spark workloads in Microsoft. For example, the Spark Summit 2016 talk (slides, video) covers the lessons learned using Spark in a Bing-scale workload.

你可以細細閱讀我們的GitHub網站,我們歡迎你做更多貢獻。關於Mobius更多的信息,可以在2016年峯會Mobius談話的幻燈片和視頻中找到。最後,微軟公司Mobius提供了一些基於.NET的Spark 資料。例如2016年Spark峯會談話(幻燈片,視頻)涵蓋了在Bing-scale中研究使用Spark的課程。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章