Java 使用 hadoop 流程

上一篇文章寫了如何藉助 docker 搭建一套可以簡單運行的 Hadoop 集羣，搭建好了就可以使用了。

在 hadoop 應用中，最簡單的例子應該就是 wordcount 這種類型的了，這次也來走一遍這個流程。

項目搭建

IDEA、Maven 項目

放下 pom.xml 文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>mavenusage</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>commons-beanutils</groupId>
            <artifactId>commons-beanutils</artifactId>
            <version>1.9.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.8.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.8.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.5</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>com.myhadoop.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

有一個HDFSConnect.java，這個是用來測試是否可以用代碼鏈接搭建好的 hadoop 集羣。文件內容如下：

package com.myhadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;

public class HDFSConnect {
    public static void main(String[] args) throws IOException {
        System.out.println("hello world");
        Configuration conf=new Configuration();
        conf.set("fs.defaultFS","hdfs://localhost:19000");
        FileSystem hdfs = FileSystem.get(conf);
        boolean is_success = hdfs.mkdirs(new Path("/guoruibiaonew"));
        if(is_success){
            System.out.println("success");
        }else{
            System.out.println("failure");
        }
        hdfs.close();
    }
}

這裏需要注意的是端口部分，docker 映射到本地是19000，如果打開了防火牆設置，直連 docker 容器內部的話，應該是172.18.0.2:9000。

查看是否創建成功，就可以隨便找一臺節點，使用如下命令查看即可：

hdfs dfs -ls /

編寫代碼

編寫代碼遵循 map ➕ reduce模式即可。

MyMapper.java

package com.myhadoop;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private Text k = new Text();
    private IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split(" ");
        for (String word : words) {
            k.set(word);
            context.write(k, v);
        }
    }
}

MyReducer.java

package com.myhadoop;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        v.set(sum);
        context.write(key,v);
    }
}

WordCount.java

package com.myhadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class WordCount {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //構建Configuration實例
        Configuration configuration = new Configuration();
        //其他配置信息

        //獲得Job實例
        Job job = Job.getInstance(configuration,"My WordCount Job");
        job.setJarByClass(WordCount.class);

        //設置Mapper和Reducer處理類
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        //設置Mapper和Reducer的輸入輸出格式
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //設置輸出結果的數據格式
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //指定輸入和輸出路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //提交任務，true爲提交成功，如果爲true打印0，爲false打印1
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);

    }
}

項目打包

因爲本項目使用的是 maven 構建，所以可以很容易的打包。
操作流程是

mvn clean
mvn package

對應到 IDE 裏面直接看下圖即可。

扔到 hadoop 中執行

在執行之前，先隨便寫點內容，放到 hdfs 上。比如寫一個 data.log，文件內容如下：

hello world
hello hadoop
hello tiger
this is a data file.

如果這個文件是在本地編寫的，那還需要把文件拷貝到 docker 的 container 中。具體命令爲

docker cp /Users/biao/IDEAProjects/mavenusage/data.log 7da3f0644f0f:/tmp

然後再 hadoop-node1 上使用 hdfs 命令將文件上傳到 hadoop 的 HDFS 上。

# 如果未創建 hdfs 上的文件目錄，需要創建一下，命令如下：
# hdfs dfs -mkdir /guoruibiao
hdfs dfs -put  /tmp/data.log /guoruibiao

然後需要注意的是maven 打包好的 jar 文件，也是需要放到數據節點中的，否則執行就會失敗，命令如下：

docker cp /Users/biao/IDEAProjects/mavenusage/target/mavenusage-1.0-SNAPSHOT.jar 7da3f0644f0f:/tmp
docker cp /Users/biao/IDEAProjects/mavenusage/target/mavenusage-1.0-SNAPSHOT.jar fe846930210d:/tmp

路徑按照自己的來，這裏只是做下參考。

do it

萬事俱備了，下面正式將 jar 交給 hadoop 去執行。

[root@hadoop-node1 tmp]# hadoop jar /tmp/mavenusage-1.0-SNAPSHOT.jar com.myhadoop.WordCount /guoruibiao/data.log /wordcountoutput
20/04/11 06:52:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/04/11 06:52:09 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
20/04/11 06:52:10 INFO input.FileInputFormat: Total input files to process : 1
20/04/11 06:52:10 INFO mapreduce.JobSubmitter: number of splits:1
20/04/11 06:52:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1586586904355_0003
20/04/11 06:52:11 INFO impl.YarnClientImpl: Submitted application application_1586586904355_0003
20/04/11 06:52:11 INFO mapreduce.Job: The url to track the job: http://hadoop-node1:8088/proxy/application_1586586904355_0003/
20/04/11 06:52:11 INFO mapreduce.Job: Running job: job_1586586904355_0003
20/04/11 06:52:22 INFO mapreduce.Job: Job job_1586586904355_0003 running in uber mode : false
20/04/11 06:52:22 INFO mapreduce.Job:  map 0% reduce 0%
20/04/11 06:52:33 INFO mapreduce.Job:  map 100% reduce 0%
20/04/11 06:52:42 INFO mapreduce.Job:  map 100% reduce 100%
20/04/11 06:52:43 INFO mapreduce.Job: Job job_1586586904355_0003 completed successfully
20/04/11 06:52:43 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=130
		FILE: Number of bytes written=315797
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=167
		HDFS: Number of bytes written=64
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=8082
		Total time spent by all reduces in occupied slots (ms)=6327
		Total time spent by all map tasks (ms)=8082
		Total time spent by all reduce tasks (ms)=6327
		Total vcore-milliseconds taken by all map tasks=8082
		Total vcore-milliseconds taken by all reduce tasks=6327
		Total megabyte-milliseconds taken by all map tasks=8275968
		Total megabyte-milliseconds taken by all reduce tasks=6478848
	Map-Reduce Framework
		Map input records=4
		Map output records=11
		Map output bytes=102
		Map output materialized bytes=130
		Input split bytes=109
		Combine input records=0
		Combine output records=0
		Reduce input groups=9
		Reduce shuffle bytes=130
		Reduce input records=11
		Reduce output records=9
		Spilled Records=22
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=142
		CPU time spent (ms)=1690
		Physical memory (bytes) snapshot=412508160
		Virtual memory (bytes) snapshot=3884216320
		Total committed heap usage (bytes)=270008320
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=58
	File Output Format Counters
		Bytes Written=64
[root@hadoop-node1 tmp]# hdfs dfs -ls /wordcountoutput
hdfs dFound 2 items
-rw-r--r--   2 root supergroup          0 2020-04-11 06:52 /wordcountoutput/_SUCCESS
-rw-r--r--   2 root supergroup         64 2020-04-11 06:52 /wordcountoutput/part-r-00000
fs[root@hadoop-node1 tmp]# hdfs dfs -cat /wordcountoutput/part-r-00000
a	1
data	1
file.	1
hadoop	1
hello	3
is	1
this	1
tiger	1
world	1
[root@hadoop-node1 tmp]#

enjoy it.

查看專欄詳情

Java 使用 hadoop 流程

項目搭建

編寫代碼

MyMapper.java

MyReducer.java

WordCount.java

項目打包

扔到 hadoop 中執行

do it

我的信仰是什麼？

app的push流程分析

你見過代碼裏面的“龜派氣功”嗎？

docker 搭建 hadoop 集羣平臺

整理的敏感詞解決思路

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結