Spark - 簡介

簡介

跨不同的workloads和platforms,是統一的分佈式計算引擎。它使用各種範式(paradigms,比如Spark streaming, Spark ML, Spark SQL, and Spark GraphX),可以連接不同的platforms,處理不同的數據workloads。

fast in-memory data processing engine。
由core和庫組成。
core是分佈式的計算引擎,提供了Java、Scala和Python API。
Spark provides real-time streaming, queries, machine learning, and graph processing.

  • Uses in-memory processing as much as possible
  • General purpose engine to be used for batch, real-time workloads
  • Compatible with YARN and also Mesos
  • Integrates well with HBase, Cassandra, MongoDB, HDFS, Amazon S3, and other file systems and data sources

特性:

  • Transparently(透明) processes data on multiple nodes via a simple API
  • Resiliently(彈性) handles failures
  • 主要使用內存,必要時溢出到磁盤
  • The same Spark code can run standalone, in Hadoop YARN, Mesos, and the cloud

Apache Spark does not provide a Storage layer and relies on HDFS or Amazon S3 and so on.
Hadoop provides distributed storage and a MapReduce distributed computing framework, Spark on the other hand is a data processing framework that operates on the distributed data storage provided by other technologies.
if you need to do analytics on streaming data or your processing requirements need multistage processing logic, you will probably want to want to go with Spark.

three layers:

  • cluster manager: can be standalone, YARN, or Mesos。Using local mode, you don’t need a cluster manager to process
  • core:which provides all the underlying APIs to perform task scheduling and interacting with storage
  • such as Spark SQL to provide interactive(互動) queries, Spark streaming for real-time analytics, Spark ML for machine learning, and Spark GraphX for graph processing

three layers

Spark core

底層通用執行引擎。包含運行作業所需的功能,以及其他組件所需要的功能。
提供內存計算,引用外部存儲中的數據集,Resilient Distributed Dataset (RDD)。

提供了訪問各種文件系統的邏輯,比如such as HDFS, Amazon S3, HBase, Cassandra, relational databases。
也提供基本的:

  • networking, security, scheduling支持函數
  • data shuffling(清洗)to build a high scalable(可擴展), fault-tolerant(容錯) platform for distributed computing

DataFrames and datasets built on top of RDDs。

Spark SQL

Spark SQL is a component on top of Spark core that introduces a new data abstraction called SchemaRDD。
支持結構化的和半結構化的數據。
使用Spark and Hive QL支持的SQL子集,可操作大量分佈式數據。
通過DataFrames and datasets,簡化了對結構化數據的處理。
支持read/write各種數據源(比如文件、Hive, HDFS, S3,關係型數據庫)。
提供了查詢優化框架-Catalyst-提高速度(比RDDs快)。
包含Thrift server-可以使用JDBC,從外部系統查詢數據。

Spark streaming

可以從各種源(HDFS, Kafka, Flume, Twitter, ZeroMQ, Kinesis)執行流分析。
使用micro-batches of data處理塊數據。
可以在RDDs之上運行。
可以從各種故障中自動恢復。
可以和其他組件在一個程序中組合。

Spark GraphX

GraphX provides functions for building graphs, represented as Graph RDDs
可以使用Pregel abstraction API,爲用戶定義的圖建模。
GraphX also contains implementations of the most important algorithms of graph theory, such as page rank, connected components, shortest paths, SVD++, and others.

Spark ML

MLlib是分佈式機器學習框架。
providing various algorithms such as logistic regression, Naive Bayes classification, Support Vector Machines (SVMs), decision trees, random forests, linear regression, Alternating Least Squares (ALS), and k-means clustering。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章