Apache Hadoop over OpenStack Swift

原文地址：http://bigdatacraft.com/archives/349 By Camuel Gilyadov, on March 1st, 2012

This is a post by Constantine Peresypkin and David Gruzman.

本文由Constantine Peresypkin 和 David Gruzman 提供。

Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn’t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc…)

最近我們在做Hadoop與OpenStack Swift集成的工作。在此不對Hadoop和OpenStack做額外介紹。Swift是一個面向對象的存儲系統，而且這項技術晚於RackSpace的cloud-files（還有其它一些系統如 Korea Telecom，Internap 等……）

Before we go into details of Hadoop-Swift integration let’s get some relevant background:

在介紹Hadoop-Swift集成工作之前，先說明相關的背景：

Hadoop already have integration with Amazon S3 and is widely used to crunch S3-stored data. http://wiki.apache.org/hadoop/AmazonS3Hadoop已經與Amazon S3存儲系統集成，並且廣泛的用於處理S3存儲的數據。
NameNode is a known SPOF in Hadoop. If it can be avoided it would be nice.NamoNode在Hadoop中是導致單節點故障的源頭。如果能避免的話就更好。
Current S3 integration stages all data as temporary files on local disk to S3. That’s because S3 needs to know content length in advance it is one of the required headers.當前的S3集成階段所有數據作爲S3的本地磁盤上的臨時文件中。這是因爲S3需要預先知道內容長度，也就是所需的頭信息。
Current S3 also suffers form 5GB max file limitation which is slightly annoying.當前S3也受到5 GB的最大文件限制，這有點煩人。
Hadoop requires seek support which means that HTTP range support is required if it is run over an object-storage . S3 supports it.Hadoop如果它運行在對象存儲系統上就需要HTTP方面的支持。而S3支持它。
Append file support is optional for Hadoop, but it’s required for HBase. S3 doesn’t have any append support thus native integration can not use HBase over S3.附加文件對Hadoop支持是可選的，但對HBase是必須的。S3沒有任何的附加文件支持服務，本地集成必能用於S3之上的HBase。
While OpenStack Swift is compatible with S3, RackSpace CloudFiles is not. It is because RackSpace CloudFiles disables S3-compatibility layer in Swift. This prevents existing Swift users from integration with Hadoop.OpenStack swift與S3兼容，但RackSpace CloudFiles不可以。這是因爲RackSpace CloudFiles在Swift中禁用S3兼容層。防止現有的Swift用戶與Hadoop集成。
The only information that is available on Internet on Hadoop-Swift integration is that with using Apache Whirr! it should work. But for best of our knowledge it is relevant only to rolling out Block FileSystem on top of Swift not a Native FileSystem. In other words we haven’t found any solution on how to process data that is already stored in RackSpace CloudFiles without costly re-importing.在網絡上僅有可用的關於Hadoop-Swift集成信息是使用Apache Whirr！它應該是可以使用的。但是我們知道，它只能在Swift之上取出塊存儲文件系統中的數據不是本地存文件系統。也就是說，對於如何處理已經存儲在RackSpace CloudFiles中的數據，除了消耗嚴重的的重輸入方式外，我們還沒有發現其它解決方案。

So instrumented with above information let’s examine what we got here:

根據上述資料，我們再看一下這裏：

In general we instrumented Hadoop to run over Swift naively, without resorting to S3 compatibility layer. This means it works with CloudFiles which misses the S3-compatibility layer.一般情況下，我們直接測試在Swift上運行Hadoop而沒藉助於S3兼容層，也就是說與CloudFiles工作的時候失去了S3兼容層。
CloudFiles client SDK doesn’t have support for HTTP range functionality. Hacked it to allow using HTTP range, this is a must for Hadoop to work. CloudFiles 客戶端SDK不支持HTTP方法。對Hadoop來說支持HTTP方式是必須要做到的。
Removed the need for NameNode in similar way it is removed with S3 integration for Amazon.像Amazon S3的方式一樣去除對NameNode 的要求。
As opposed to S3 implementation we avoided staging files on local disk to and from CloudFiles/Swift. In other words data directly streamed to/from compute node RAM into CloudFiles/Swift.相對於S3實現，我們避免了送到或者來自CloudFiles /Swift的文件在本地磁盤的轉存。也就是說數據直接從計算節點RAM流入/流出到 CloudFiles /Swift。
Though the data is still processed remotely. Extensive data shipping takes place between compute nodes and CloudFiles/Swift. As frequent readers of this blog know we are working on technology that will allow to run code snippets directly in Swift. Look here for more details: http://www.zerovm.com. As next step we plan to perform predicate-pushdown optimization to process most of data completely locally inside ZeroVM-enabled object-storage system.數據的處理仍在是在遠端，在計算節點和CloudFiles/Swift和計算節點之間有大量的數據遷移。關注這個博客的讀者都知道，我們正在更新技術來允許直接在Swift上執行代碼段。在http://www.zerovm.com中可以看到更詳盡的信息。
Support for native Swift large objects is planned also (something that’s absent in Amazon S3)
We also working on append support for Swift (this could be easily done through Swift large object support which uses versioning) so even HBase will work on top of Swift, and this is not the case with S3 now.
As it is the case with Hadoop S3, storing BigData in native format on Swift provides options for multi-site replication and CDN

I also added a question to Quora on this issue: http://www.quora.com/Is-it-possible-to-run-Hadoop-to-directly-process-data-stored-in-RackSpace-CloudFiles-Swift

JasonCcccc

發佈了41 篇原創文章 · 獲贊 43 · 訪問量 40萬+

私信關注

Apache Hadoop over OpenStack Swift（在swfit框架上運行Hadoop）

Apache Hadoop over OpenStack Swift

Running Celery as root

Python 中的 str 與 unicode 編碼處理

SQLite vs MySQL vs PostgreSQL: A Comparison Of Relational Database Management Systems

PyInstaller 打包 Python 源碼爲 exe 可執行文件

CentOS 下 VNC Server 的配置與使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結