HADOOP(1)__Mapreduce_WordCount統計單詞數

原創

2018-11-04 12:11

最近開始接觸大數據方面的學習，準備做一個系列筆記來介紹筆者的學習過程。文章簡單介紹Hadoop的集羣搭建、Mapreduce編程的主要流程及如何提交到Linux服務器中的yarn運行MapReduce程序。針對一些常見問題也作了簡單的說明。

HADOOP集羣搭建

HADOOP是利用服務器集羣，根據用戶的業務邏輯，對海量數據進行分佈式處理的大數據框架。主要核心組件有分佈式文件系統（HDFS）、運算資源調度系統（YARN）、分佈式運算框架（MAPREDUCE）。HADOOP集羣搭建就是在N臺Linux服務器中搭建HDFS集羣和YARN集羣。
1.HADOOP是用JAVA開發的框架，因此需要依賴JDK，關於JDK在Linux中的配置本文就不再說明，網上有很多可以參考，安裝過程並不難。
2.在官網上下載HADOOP編譯文件，即hadoop-xxx-tar.gz文件上到Linux服務器中解壓（注意版本要與Linux版本配合，本文用的是Hadoop-2.6.4版本，Linux版本是Centos6.5）；
3.配置好集羣中的各服務器的IP、主機名、hosts文件等、配置ssh免登陸，關閉防火牆、配置Hadoop的環境變量/etc/profile

ssh免登陸配置：假如主機A 要ssh登陸主機 B，在主機A上操作：首先生成密鑰對ssh-keygen (提示時，直接回車即可)，再將A自己的公鑰拷貝並追加到B的授權列表文件authorized_keys中 ssh-copy-id B ，其中A、B分別爲兩臺Linux的hostname

4.配置hodoop /etc/hodoop/下的各文件
hadoop-env.sh添加內容
export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_51
core-site.xml文件添加內容

hdfs-site.xml添加內容

mapred-site.xml添加內容

yarn-site.xml添加內容

salves (指定集羣節點)

hdp-node-01
hdp-node-02
hdp-node-03

5.啓動集羣

#初始化
bin/hadoop  namenode  -format
#啓動HDFS
sbin/start-dfs.sh
#啓動yarn
sbin/start-yarn.sh

測試集羣

通過hadoop fs -ls /測試hadoop是否成功，也可通過jps命令查看java進程是否有NameNode、ResourceManager、DataNode、NodeMagnager進程

通過JAVA API上傳文件
其中192.168.10.121爲其中一個節點IP地址，最好在Linux中搭建Eclipse開發環境。

public class HdfsClient {

    private FileSystem fs;

    @Before
    public void init() throws Exception {

        Configuration conf = new Configuration();
        conf.set("dfs.replication", "2");
        fs = FileSystem.get(new URI("hdfs://192.168.10.121:9000"), conf, "jeang");
    }

    @Test
    public void testAddHdfsFile() throws Exception {
        Path src = new Path("E:/testHdfs.txt");
        Path dst = new Path("/wordcount/data.txt");
        fs.copyFromLocalFile(src, dst);
        fs.close();  //關閉資源
    }

}

hadoop fs -ls /wordcount查看是否上傳成功，如下圖表示上傳成功

MapReduce程序

主要包括Map、Reduce類、方法實現以及yarn任務提交main類

Map類

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{

    /**
     * 複寫方法
     */
    @Override
    protected void map(LongWritable key, Text value,
            Mapper<LongWritable, Text, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
        //拿到第一行數據並轉爲String
        String line = value.toString();
        String[] words = line.split(" ");
        //遍歷單詞
        for (String word : words) {
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

Reduce類

public class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable>{

    /**
     * 每map後進來一個KV，就執行reduce方法一次
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Reducer<Text, IntWritable, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
        //計數器
        int wordCount = 0;
        for (IntWritable value : values) {
            wordCount += value.get();
        }

        context.write(key, new IntWritable(wordCount));
    }
}

yarn任務提交類

public class YarnRunner {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        /*conf.set("mapreduce.framework.name", "yarn");
        conf.set("yarn.resourcemanager.hostname", "server1");*/

        Job job = Job.getInstance(conf);
        job.setJarByClass(YarnRunner.class);

        job.setMapperClass(WordCountMap.class);
        job.setReducerClass(WordCountReduce.class);

        //設置業務邏輯：Map\Reduce的輸入輸出類型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //指定位置
        FileInputFormat.setInputPaths(job, new Path("hdfs://server1:9000/wordcount/data.txt"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://server1:9000/wordcount/result"));

        //提交任務,打印信息
        boolean completion = job.waitForCompletion(true);
        System.exit(completion ? 0 : 1);

    }
}

導出jar包，上傳到集羣中，運行jar包，指定運行類

一個常見的問題是運行jar包時，出現拒絕連接的情況，這一般是Linux環境配置導致的，編輯/etc/hosts文件，將IP與主機名稱映射，示例如下，其中server1~server4是集羣中的節點IP
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost localhost6.localdomain6
192.168.10.121 server1
192.168.10.122 server2
192.168.10.123 server3
192.168.10.124 server4
查看結果

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

HADOOP(1)__Mapreduce_WordCount統計單詞數

HADOOP集羣搭建

測試集羣

MapReduce程序

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

全面系統的AI學習路徑，幫助普通人也能玩轉AI

從零開始：使用 Playwright 腳本錄製實現自動化測試

uni-app實現上拉加載

Spring Security Oauth2實踐(3) - 單點登錄（SSO）

Spring Security Oauth2實踐(1) - 授權碼模式

利用jstack工具分析JVM線程

Spring Security Oauth2實踐(2) - 客戶端對接

算法練習_LeetCode_鏈表1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結