Hadoop 3.x主要變化(相對於Hadoop 2.x)

 

       今天有人問Hadoop 3.x的主要變動在哪裏,這裏在官網(http://hadoop.apache.org/docs/r3.0.0/index.html)查了下,總結簡單翻譯如下:

  • 1.要求JDK>=1.7
  • 2.HDFS支持糾刪碼

       與副本相比糾刪碼是一種更節省空間的數據持久化存儲方法。標準編碼(比如Reed-Solomon(10,4))會有

1.4 倍的空間開銷;然而HDFS副本則會有3倍的空間開銷。因爲糾刪碼額外開銷主要是在重建和執行遠程讀,它傳統用於存儲冷數據,即不經常訪問的數據。當部署這個新特性時用戶應該考慮糾刪碼的網絡和CPU 開銷。更多關於HDFS的糾刪碼可以參見http://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html.
  • 3.YARN Timeline Service版本更新到v.2

        本版本引入了Yarn時間抽服務v.2,主要用於解決2大挑戰:改善時間軸服務的可伸縮性和可靠性,通過引入流和聚合增強可用性。

YARN Timeline Service v.2 alpha 1可以讓用戶和開發者測試以及反饋,以便使得它可以替換現在的Timeline Service v.1.x。請在測試環境中使用。更多關於YARN Timeline Service v.2的知識請參見http://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html

  • 4.重寫相關shell腳本,比如所有腳本都以hadoop-env.sh爲基礎腳本等等

        Hadoop的Shell腳本被重寫解決了之前很多長期存在的bug,並且引入了一些新的特性。絕大部分都保持兼容性,不過仍有些變化可能使得現有的安裝不能正常運行。不兼容的改變可以參見HADOOP-9902。更多內容請參見Unix Shell Guide文檔。即使你是資深用戶,也建議看下這個文檔,因爲其描述了許多新的功能,特別是與可擴展性有關的功能。

  • 5.合併客戶端jar,比如使用maven的shaded插件將 hadoop-client-api和hadoop-client-runtime合併爲一個jar

        在 Hadoop 2.x 版本,hadoop-client Maven artifact將 Hadoop 所有的依賴都加到 Hadoop 應用程序的環境變量中,這樣會可能會導致應用程序依賴的類和 Hadoop 依賴的類有衝突。這個問題在 HADOOP-11804 得到了解決。

  • 6.支持投機性的容器和分佈式調度,比如在沒有資源可分配時仍可執行一個Applications

        Opportunistic Container引入新 Opportunistic 類型的 Container 後,這種 Container 可以利用節點上已分配但未真正使用的資源。原有 Container 類型定義爲 Guaranteed 類型。相對於 Guaranteed 類型Container, Opportunistic 類型的Container優先級更低。

  • 7.MapReduce本地優化

        MapReduce添加了Map輸出collector的本地實現。對於shuffle密集型的作業來說,這將會有30%以上的性能提升。更多內容請參見 MAPREDUCE-2841

  • 8.支持2個以上namenode

        初的HDFS NameNode high-availability實現僅僅提供了一個active NameNode和一個Standby NameNode;並且通過將編輯日誌複製到三個JournalNodes上,這種架構能夠容忍系統中的任何一個節點的失敗。然而,一些部署需要更高的容錯度。我們可以通過這個新特性來實現,其允許用戶運行多個Standby NameNode。比如通過配置三個NameNode和五個JournalNodes,這個系統可以容忍2個節點的故障,而不是僅僅一個節點。HDFS high-availability文檔已經對這些信息進行了更新,我們可以閱讀這篇文檔瞭解如何配置多於2個NameNodes。

  • 9.默認的端口和服務有改變

        在此之前,多個Hadoop服務的默認端口都屬於Linux的臨時端口範圍(32768-61000)。這就意味着我們的服務在啓動的時候可能因爲和其他應用程序產生端口衝突而無法啓動。現在這些可能會產生衝突的端口已經不再屬於臨時端口的範圍,這些端口的改變會影響NameNode, Secondary NameNode, DataNode以及KMS。與此同時,官方文檔也進行了相應的改變,具體可以參見 HDFS-9427以及HADOOP-12811。

       
namenode端口 namenode 8020 9820
namenode端口 namenode htttp web 50070 9870
namenode端口 namenode https web 50470 9871
secondnamenode端口 secondnamenode https web 50091 9869
secondnamenode端口 secondnamenode http web 50090 9868
datanode端口 datanode ipc 50020 9867
datanode端口 datanode 50010 9866
datanode端口 datanode http web 50075 9864
datanode端口 datanode https web 50475 9865

 

  • 10.支持微軟Azure存儲系統和阿里雲存儲系統
  • 11.新增內部節點的平衡器

        一個DataNode可以管理多個磁盤,正常寫入操作,各磁盤會被均勻填滿。然而,當添加或替換磁盤時可能導致此DataNode內部的磁盤存儲的數據嚴重內斜。這種情況現有的HDFS balancer是無法處理的。這種情況是由新intra-DataNode平衡功能來處理,通過hdfs diskbalancer CLI來調用。更多請參考HDFS Commands Guide

  • 12.重做了後臺程序和任務的堆內存管理

        Hadoop守護進程和MapReduce任務的堆內存管理髮生了一系列變化。

HADOOP-10950:介紹了配置守護集成heap大小的新方法。主機內存大小可以自動調整,HADOOP_HEAPSIZE 已棄用。
MAPREDUCE-5785:map和reduce task堆大小的配置方法,所需的堆大小不再需要通過任務配置和Java選項實現。已經指定的現有配置不受此更改影響。
  • 13.針對S3文件系統支持DynamoDB存儲

        HADOOP-13345 裏面爲 Amazon S3 存儲系統的 S3A 客戶端引入了一個新的可選特性,也就是可以使用 DynamoDB 表作爲文件和目錄元數據的快速一致的存儲。

  • 14.HDFS支持基於路由器的聯盟

        HDFS Router-Based Federation 添加了一個 RPC路由層,提供了多個 HDFS 命名空間的聯合視圖。與現有 ViewFs 和 HDFS Federation 功能類似,不同之處在於掛載表(mount table)由服務器端(server-side)的路由層維護,而不是客戶端。這簡化了現有 HDFS客戶端 對 federated cluster 的訪問。 詳細請參見:HDFS-10467

  • 15.提供REST API來修改容量調度

        OrgQueue 擴展了 capacity scheduler ,通過 REST API 提供了以編程的方式來改變隊列的配置,This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.。詳細請參見:YARN-5734

  • 16.YARN的資源除了傳統的CPU和內存外,還可以支持用戶自定義的資源類型,比如GPU

        YARN 資源模型(YARN resource model)已被推廣爲支持用戶自定義的可數資源類型(support user-defined countable resource types),不僅僅支持 CPU 和內存。比如集羣管理員可以定義諸如 GPUs、軟件許可證(software licenses)或本地附加存儲器(locally-attached storage)之類的資源。YARN 任務可以根據這些資源的可用性進行調度。詳細請參見: YARN-3926。

Apache Hadoop 3.0.0

Apache Hadoop 3.0.0 incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).

This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready.

Overview

Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.

Minimum required Java version increased from Java 7 to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.

Support for erasure coding in HDFS

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

More details are available in the HDFS Erasure Coding documentation.

YARN Timeline Service v.2

We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

YARN Timeline Service v.2 alpha 2 is provided so that users and developers can test it and provide feedback and suggestions for making it a ready replacement for Timeline Service v.1.x. It should be used only in a test capacity.

More details are available in the YARN Timeline Service v.2 documentation.

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

Incompatible changes are documented in the release notes, with related discussion on HADOOP-9902.

More details are available in the Unix Shell Guide documentation. Power users will also be pleased by the Unix Shell API documentation, which describes much of the new functionality, particularly related to extensibility.

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

HADOOP-11804 adds new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. This avoids leaking Hadoop’s dependencies onto the application’s classpath.

Support for Opportunistic Containers and Distributed Scheduling.

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

Opportunistic containers are by default allocated by the central RM, but support has also been added to allow opportunistic containers to be allocated by a distributed scheduler which is implemented as an AMRMProtocol interceptor.

Please see documentation for more details.

MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

See the release notes for MAPREDUCE-2841 for more detail.

Support for more than 2 NameNodes.

The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

The HDFS high-availability documentation has been updated with instructions on how to configure more than two NameNodes.

Default ports of multiple services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes for HDFS-9427 and HADOOP-12811 for a list of port changes.

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors

Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

Intra-datanode balancer

A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guidefor more information.

Reworked daemon and task heap management

A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.

HADOOP-10950 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZEvariable has been deprecated. See the full release notes of HADOOP-10950 for more detail.

MAPREDUCE-5785 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.

S3Guard: Consistency and Metadata Caching for the S3A filesystem client

HADOOP-13345 adds an optional feature to the S3A client of Amazon S3 storage: the ability to use a DynamoDB table as a fast and consistent store of file and directory metadata.

See S3Guard for more details.

HDFS Router-Based Federation

HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federationfunctionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.

See HDFS-10467 and the HDFS Router-based Federation documentation for more details.

API-based configuration of Capacity Scheduler queue configuration

The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.

See YARN-5734 and the Capacity Scheduler documentation for more information.

 

YARN Resource Types

The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.

See YARN-3926 and the YARN resource model documentation for more information.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章