Hadoop中的block Size和split Size是什麼關係

學習hadoop map reduce過程的時候,第一步就是split。我們知道,hdfs中的數據是按block來存儲的。問題來了,那麼split和block之間是什麼關係呢?我google到了stackoverflow上的這篇文章,我覺得這個帖子應該把關係說清楚了,翻譯出來,歡迎大家批評指正!以下:

問題

hadoop的split size 和 block size 是什麼關係? 是否 split size 應該 n倍於 block size ?

概念

在 hdfs 架構中,存在 blocks 的概念。 通常來說,hdfs中的一個block 是 64MB 。 當我們把一個大文件導入hdfs中的時候,文件會按 64MB 每個block來分割(版本不同,默認配置可能不同)。 如果你有1GB的文件要存入HDFS中, 1GB/64MB = 1024MB / 64MB = 16 個blocks 會被分割到不同的datanode上。

目的

數據分割(data splitting )策略是基於文件偏移進行的。文件分割的目的是 有利於數據並行處理 ,以及 便於數據容災恢復。

區別

split 是邏輯意義上的split。 通常在 M/R 程序或者其他數據處理技術上用到。根據你處理的數據量的情況,split size是允許用戶自定義的。

split size 定義好了之後,可以控制 M/R 中 Mapper 的數量。如果M/R中沒有定義 split size , 就用 默認的HDFS配置作爲 input split。

舉例

你有個100MB的文件,block size 是 64MB , 那麼就會被split成 2 塊。這時如果你你沒有指定 input split , 你的M/R程序就會按2個input split 來處理 , 並分配兩個mapper來完成這個job。

但如果你把 split size 指定爲 100MB,那麼M/R程序就會把數據處理成一個 split,這時只用分配一個mapper 就可以了。

但如果你把split size 指定爲 25MB,M/R就會將數據分成4個split,分配4個mapper來處理這個job。

總結

  1. block是物理上的數據分割,而split是邏輯上的分割。
  2. 如果沒有特別指定,split size 就等於 HDFS 的 block size 。
  3. 用戶可以在M/R 程序中自定義split size。
  4. 一個split 可以包含多個blocks,也可以把一個block應用多個split操作。
  5. 一個split不會包含兩個File的Block,不會跨越File邊界
  6. 有多少個split,就有多少個mapper。

原文:

Q: What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size? **

A: In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS,then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different DataNode based on your cluster configuration.

Data splitting happens based on file offsets.The goal of splitting of file and store it into different blocks is parallel processing and fail over of data.

Difference between block size and split size.

Split is logical split of your data, basically used during data processing using Map/Reduce program or other processing techniques. Split size is user defined and you can choose your split size based on your data(How much data you are processing).

Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

Example:

Suppose you have a file of 100MB and HDFS default block configuration is 64MB then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.

But suppose,you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.

Suppose,you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.

Conclusion:

  1. Split is a logical division of the input data while block is a physical division of data.
  2. HDFS default block size is default split size if input split is not specified.
  3. Split is user defined and user can control split size in his Map/Reduce program.
  4. One split can be mapping to multiple blocks and there can be multiple split of one block.
  5. The number of map tasks (Mapper) are equal to the number of splits.

ref: http://stackoverflow.com/questions/30549261/split-size-vs-block-size-in-hadoop

補充

HDFS中如何設置參數

split size 配置:

mapred.max.split.size

block size 配置:

dfs.block.size

注意,請不要輕易更改 block size,因爲這個是影響整個hdfs的。

split size 計算規則:

針對 : mapreduce.input.fileinputformat.split.maxsize mapreduce.input.fileinputformat.split.minsize FileInputFormat.computeSplitSize()

按如下規則取 split size :Math.max(minSize, Math.min(maxSize, blockSize))

minSize = 1 maxSize = 2的63次方減1,所以默認情況 split size = block size

性能

一般來說,block size 和 split size 設置成一致,性能較好。

FileInputFormat generates splits in such a way that each split is all or part of a single file. 所以 hadoop處理大文件比處理小文件來得效率高得多。

參考 <http://www.imx3.com/hadoop_blocksize_vs_splitsize.html>

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章