在SPRING的官方博客上,對Spring Hadoop做了一些介紹,並給出了配置的實例,包括MapReduce,HBase/Hive/Pig等,詳細內容見下文!
MapReduce Jobs
The Hello world for Hadoop is the
word count example – a simple use-case that exposes the base Hadoop capabilities. When using Spring Hadoop, the word count example looks as follows:
input-path = "/input/"
output-path = "/ouput/" |
mapper = "org.apache.hadoop.examples.WordCount.TokenizerMapper" |
reducer = "org.apache.hadoop.examples.WordCount.IntSumReducer" />
|
< bean
id = "runner"
class = "org.springframework.data.hadoop.mapreduce.JobRunner" |
p:jobs-ref = "word-count" /> |
Notice how the creation and submission of the job configuration is handled by the IoC container. Whether the Hadoop configuration needs to be tweaked or the reducer needs extra parameters, all the configuration options are still available for you to configure.
This allows you to start small and have the configuration grow alongside the app. The configuration can be as simple or advanced as the developer wants/needs it to be taking advantage of Spring
container functionality such as
property placeholders and
environment support:
< hdp:configuration
resources = "classpath:/my-cluster-site.xml" >
|
hadoop.tmp.dir=file://${java.io.tmpdir}
|
< context:property-placeholder
location = "classpath:hadoop.properties"
/> |
< hdp:cache
create-symlink = "true" >
|
< hdp:classpath
value = "/cp/some-library.jar#library.jar"
/> |
< hdp:cache
value = "/cache/some-archive.tgz#main-archive"
/> |
< hdp:cache
value = "/cache/some-resource.res"
/> |
(the word count
example is part of the Spring Hadoop distribution – feel free to download it and experiment).
Spring Hadoop does not require one to rewrite your MapReduce job in Java, you can use non-Java
streaming jobs seamlessly: they are just objects (or as Spring calls them beans) that are created, configured, wired and managed just like any other by the framework in a consistent,
coherent manner. The developer can mix and match according to her preference and requirements without having to worry about integration issues.
< hdp:streaming
id = "streaming-env" |
input-path = "/input/"
output-path = "/ouput/" |
mapper = "${path.cat}"
reducer = "${path.wc}" >
|
EXAMPLE_DIR=/home/example/dictionaries/
|
Existing Hadoop
Tool implementations are also supported; in fact rather than specifying custom Hadoop properties through the command line, one can simply inject it:
< hdp:tool-runner
id = "scalding"
tool-class = "com.twitter.scalding.Tool" >
|
< hdp:arg
value = "tutorial/Tutorial1" />
|
< hdp:arg
value = "--local" />
|
The configuration above executes Tutorial1 of Twitter's
Scalding (a Scala DSL on top of Cascading (see below) library. Note there is no
dedicated support code in either Spring Hadoop or Scalding – just the standard, Hadoop APIs are being used.
Working with HBase/Hive/Pig
Speaking of DSLs, it is quite common to use higher-level abstractions when interacting with Hadoop – popular choices include
HBase, Hive or
Pig. Spring Hadoop provides integration for all of these, allowing easy configuration and consumption of these
data sources inside a Spring app:
< hdp:hbase-configuration
stop-proxy = "false"
delete-connection = "true" >
|
</ hdp:hbase-configuration >
|
< hdp:pig
properties-location = "pig-dev.properties"
/> |
< script
location = "org/company/pig/script.pig" >
|
< arguments >electric=tears</ arguments >
|
Through Spring Hadoop, one not only gets a powerful IoC container but also access to Spring's portable service abstractions. Take the popular
JdbcTemplate, one can use that on top of Hive's
Jdbc client:
< bean
id = "hive-driver"
class = "org.apache.hadoop.hive.jdbc.HiveDriver" />
|
class = "org.springframework.jdbc.datasource.SimpleDriverDataSource" |
c:driver-ref = "hive-driver"
c:url = "${hive.url}" />
|
< bean
id = "template"
class = "org.springframework.jdbc.core.JdbcTemplate" |
c:data-source-ref = "hive-ds" /> |
Cascading
Spring also supports a Java based, type-safe
configuration model. One can use it as an alternative or complement to declarative XML configurations – such as with
Cascading
public
class CascadingConfig { |
@Value ( "${cascade.sec}" )
private String sec;
|
@Bean
public Pipe tsPipe() {
|
DateParser dateParser =
new DateParser( new
Fields( "ts" ),
|
"dd/MMM/yyyy:HH:mm:ss Z" );
|
return
new Each( "arrival rate" ,
new Fields( "time" ), dateParser);
|
@Bean
public Pipe tsCountPipe() {
|
Pipe tsCountPipe =
new Pipe( "tsCount" , tsPipe());
|
tsCountPipe =
new GroupBy(tsCountPipe,
new Fields( "ts" ));
|
< bean
class = "org.springframework.data.hadoop.cascading.CascadingConfig " />
|
class = "org.springframework.data.hadoop.cascading.HadoopFlowFactoryBean" |
p:configuration-ref = "hadoop-configuration"
p:tail-ref = "tsCountPipe"
/> |
The example above mixes both programmatic and declarative configurations: the former to create the individual Cascading pipes and the former to wire them together into a flow.
Using Spring's portable service abstractions
Or use Spring's excellent
task/scheduling support to submit jobs at certain times:
< task:scheduler
id = "myScheduler"
pool-size = "10" />
|
< task:scheduled-tasks
scheduler = "myScheduler" >
|
< task:scheduled
ref = "word-count-job"
method = "submit"
cron = "0 0 * * * " />
|
The configuration above uses a simple JDK
Executor instance – excellent for
POC development. One can
easily replace it (a one-liner) in production with a more comprehensive solution such as dedicated
scheduler or a
WorkManager implementation – another example of Spring's powerful service abstractions.
HDFS/Scripting
A common task when interacting with HDFS is preparing the file-system, such as cleaning the output directory to avoid overriding data or moving all input files under the same name scheme or folder. Spring Hadoop
addresses the issue by fully embracing Hadoop's fs commands, such as
FS Shell and
DistCp and
exposing them as proper Java APIs. Mix that along with JVM scripting (whether it is Groovy, JRuby or Rhino/JavaScript) to form a powerful
combination:
< hdp:script
language = "groovy" >
|
inputPath = "/user/gutenberg/input/word/"
|
outputPath = "/user/gutenberg/output/word/"
|
if (fsh.test(inputPath)) {
|
if (fsh.test(outputPath)) {
|
fs.copyFromLocalFile("data/input.txt", inputPath)
|
Summary
This post just touches the surface of some of the features available in Spring Hadoop; I have not mentioned the Spring Batch integration providing tasklets for
various Hadoop
interactions or the use of Spring Integration for event triggering – more about that in a future entry.
Let us know what you think, what you need and give us feedback:
download the code, fork the source,
report issues,
post on the forum or send us a tweet.