Spring Hadop介紹

2012年2月29日，在這個4年纔有的日子裏，Spring Hadoop 1.0.0M1版本發佈！

在SPRING的官方博客上，對Spring Hadoop做了一些介紹，並給出了配置的實例，包括MapReduce，HBase/Hive/Pig等，詳細內容見下文！

原文鏈接：http://blog.springsource.org/2012/02/29/introducing-spring-hadoop/

原文如下：

Introducing Spring Hadoop

Posted on February 29th, 2012 by Costin Leau in Containers, Data Access, Enterprise Integration, IOC Container, Java, Open Source, Spring Data.

I am happy to announce that the first milestone release (1.0.0.M1) for Spring Hadoop project is available and talk about some of the work we have been doing over the last few months. Part of the Spring Data umbrella, Spring Hadoop provides support for developing applications based on Hadoop technologies by leveraging the capabilities of the Spring ecosystem. Whether one is writing stand-alone, vanilla MapReduce applications, interacting with data from multiple data stores across the enterprise, or coordinating a complex workflow of HDFS, Pig, or Hive jobs, or anything in between, Spring Hadoop stays true to the Spring philosophy offering a simplified programming model and addresses "accidental complexity" caused by the infrastructure. Spring Hadoop, provides a powerful tool in the developer arsenal for dealing with big data volumes.

MapReduce Jobs

The Hello world for Hadoop is the word count example – a simple use-case that exposes the base Hadoop capabilities. When using Spring Hadoop, the word count example looks as follows:

 
<!-- configure Hadoop FS/job tracker using defaults -->

<hdp:configuration
/> 

<!-- define the job -->

<hdp:job
id="word-count"

  input-path="/input/"
output-path="/ouput/"

  mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper"

  reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>

<!-- execute the job -->

<bean
id="runner"
class="org.springframework.data.hadoop.mapreduce.JobRunner"

                  p:jobs-ref="word-count"/>

Notice how the creation and submission of the job configuration is handled by the IoC container. Whether the Hadoop configuration needs to be tweaked or the reducer needs extra parameters, all the configuration options are still available for you to configure. This allows you to start small and have the configuration grow alongside the app. The configuration can be as simple or advanced as the developer wants/needs it to be taking advantage of Spring container functionality such as property placeholders and environment support:

<hdp:configuration
resources="classpath:/my-cluster-site.xml">

    fs.default.name=${hd.fs}

    hadoop.tmp.dir=file://${java.io.tmpdir}

    electric=sea

</hdp:configuration>

<context:property-placeholder
location="classpath:hadoop.properties"
/> 

<!-- populate Hadoop distributed cache -->

<hdp:cache
create-symlink="true">

  <hdp:classpath
value="/cp/some-library.jar#library.jar"
/> 

  <hdp:cache
value="/cache/some-archive.tgz#main-archive"
/> 

  <hdp:cache
value="/cache/some-resource.res"
/> 

</hdp:cache>

(the word count example is part of the Spring Hadoop distribution – feel free to download it and experiment).

Spring Hadoop does not require one to rewrite your MapReduce job in Java, you can use non-Java streaming jobs seamlessly: they are just objects (or as Spring calls them beans) that are created, configured, wired and managed just like any other by the framework in a consistent, coherent manner. The developer can mix and match according to her preference and requirements without having to worry about integration issues.

 
<hdp:streaming
id="streaming-env"

  input-path="/input/"
output-path="/ouput/"

  mapper="${path.cat}"
reducer="${path.wc}">

  <hdp:cmd-env>

    EXAMPLE_DIR=/home/example/dictionaries/

  </hdp:cmd-env>

</hdp:streaming>

Existing Hadoop Tool implementations are also supported; in fact rather than specifying custom Hadoop properties through the command line, one can simply inject it:

<!-- the tool automatically is injected with 'hadoop-configuration' -->

<hdp:tool-runner
id="scalding"
tool-class="com.twitter.scalding.Tool">

   <hdp:arg
value="tutorial/Tutorial1"/>

   <hdp:arg
value="--local"/>

</hdp:tool-runner>

The configuration above executes Tutorial1 of Twitter's Scalding (a Scala DSL on top of Cascading (see below) library. Note there is no dedicated support code in either Spring Hadoop or Scalding – just the standard, Hadoop APIs are being used.

Working with HBase/Hive/Pig

Speaking of DSLs, it is quite common to use higher-level abstractions when interacting with Hadoop – popular choices include HBase, Hive or Pig. Spring Hadoop provides integration for all of these, allowing easy configuration and consumption of these data sources inside a Spring app:

<!-- HBase configuration with nested properties -->

<hdp:hbase-configuration
stop-proxy="false"
delete-connection="true">

    foo=bar

</hdp:hbase-configuration>

<!-- create a Pig instance using custom properties

    and execute a script (using given arguments) at startup -->

<hdp:pig
properties-location="pig-dev.properties"
/> 

   <script
location="org/company/pig/script.pig">

     <arguments>electric=tears</arguments>

   </script>

</hdp:pig>

Through Spring Hadoop, one not only gets a powerful IoC container but also access to Spring's portable service abstractions. Take the popular JdbcTemplate, one can use that on top of Hive's Jdbc client:

<!-- basic Hive driver bean -->

<bean
id="hive-driver"
class="org.apache.hadoop.hive.jdbc.HiveDriver"/>

<!-- wrapping a basic datasource around the driver -->

<bean
id="hive-ds"

    class="org.springframework.jdbc.datasource.SimpleDriverDataSource"

    c:driver-ref="hive-driver"
c:url="${hive.url}"/>

<!-- standard JdbcTemplate declaration -->

<bean
id="template"
class="org.springframework.jdbc.core.JdbcTemplate"

    c:data-source-ref="hive-ds"/>

Cascading

Spring also supports a Java based, type-safe configuration model. One can use it as an alternative or complement to declarative XML configurations – such as with Cascading

@Configuration

public 
class CascadingConfig { 

    @Value("${cascade.sec}")
private String sec;

    @Bean
public Pipe tsPipe() {

        DateParser dateParser =
new DateParser(new
Fields("ts"),

                 "dd/MMM/yyyy:HH:mm:ss Z");

        return
new Each("arrival rate",
new Fields("time"), dateParser);

    } 

    @Bean
public Pipe tsCountPipe() {

        Pipe tsCountPipe =
new Pipe("tsCount", tsPipe());

        tsCountPipe =
new GroupBy(tsCountPipe,
new Fields("ts"));

    } 

}

<!-- code configuration class -->

<bean
class="org.springframework.data.hadoop.cascading.CascadingConfig "/>

<bean
id="cascade"

    class="org.springframework.data.hadoop.cascading.HadoopFlowFactoryBean"

    p:configuration-ref="hadoop-configuration"
p:tail-ref="tsCountPipe"
/>

The example above mixes both programmatic and declarative configurations: the former to create the individual Cascading pipes and the former to wire them together into a flow.

Using Spring's portable service abstractions

Or use Spring's excellent task/scheduling support to submit jobs at certain times:

<task:scheduler
id="myScheduler"
pool-size="10"/>

<task:scheduled-tasks
scheduler="myScheduler">

 <!-- run once a day, at midnight -->

 <task:scheduled
ref="word-count-job"
method="submit"
cron="0 0 * * * "/>

</task:scheduled-tasks>

The configuration above uses a simple JDK Executor instance – excellent for POC development. One can easily replace it (a one-liner) in production with a more comprehensive solution such as dedicated scheduler or a WorkManager implementation – another example of Spring's powerful service abstractions.

HDFS/Scripting

A common task when interacting with HDFS is preparing the file-system, such as cleaning the output directory to avoid overriding data or moving all input files under the same name scheme or folder. Spring Hadoop addresses the issue by fully embracing Hadoop's fs commands, such as FS Shell and DistCp and exposing them as proper Java APIs. Mix that along with JVM scripting (whether it is Groovy, JRuby or Rhino/JavaScript) to form a powerful combination:

<hdp:script
language="groovy">

  inputPath = "/user/gutenberg/input/word/"

  outputPath = "/user/gutenberg/output/word/"

  if (fsh.test(inputPath)) {

    fsh.rmr(inputPath)

  } 

  if (fsh.test(outputPath)) {

    fsh.rmr(outputPath)

  } 

  fs.copyFromLocalFile("data/input.txt", inputPath)

</hdp:script>

Summary

This post just touches the surface of some of the features available in Spring Hadoop; I have not mentioned the Spring Batch integration providing tasklets for various Hadoop interactions or the use of Spring Integration for event triggering – more about that in a future entry.
Let us know what you think, what you need and give us feedback: download the code, fork the source, report issues, post on the forum or send us a tweet.

Spring Hadop介紹

2012年2月29日，在這個4年纔有的日子裏，Spring Hadoop 1.0.0M1版本發佈！

Introducing Spring Hadoop

MapReduce Jobs

Working with HBase/Hive/Pig

Cascading

Using Spring's portable service abstractions

HDFS/Scripting

Summary

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

Excel打開從PL/SQL Deleveloper導出的csv文件亂碼問題

Eclipse 下maven使用

今天寫了個function

SPRING 3.0.5 新版本修改內容

Hibernate還是iBATIS 優缺點

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結