Hadoop學習(4)-mapreduce的一些注意事項

關於mapreduce的一些注意細節

如果把mapreduce程序打包放到了liux下去運行，

命令java –cp xxx.jar 主類名

如果報錯了，說明是缺少相關的依賴jar包

用命令hadoop jar xxx.jar 類名因爲在集羣機器上用 hadoop jar xx.jar mr.wc.JobSubmitter 命令來啓動客戶端main方法時，hadoop jar這個命令會將所在機器上的hadoop安裝目錄中的jar包和配置文件加入到運行時的classpath中

那麼，我們的客戶端main方法中的new Configuration()語句就會加載classpath中的配置文件，自然就有了

fs.defaultFS 和 mapreduce.framework.name 和 yarn.resourcemanager.hostname 這些參數配置

會把本地hadoop的相關的所有jar包都會引用

Mapreduce也有本地的job運行，就是可以不用提交到yarn上，可以以單機的模式跑一邊以多個線程模擬也可以。

就是如果不管在Linux下還是windows下，提交job都會默認的提交到本地去運行，

如果在linux默認提交到yarn上運行，需要寫配置文件hadoop/etc/mapred-site.xml文件

<name>mapreduce.framework.name</name>

</property>

</configuration>

Key,value對，如果是自己的類的話，那麼這個類要實現Writable，同時要把你想序列化的數據轉化成二進制，然後放到重寫方法wirte參數的DataOutput裏面，另一個readFields重寫方法是用來反序列化用的，

注意反序列化的時候，會先拿這個類的無參構造方法構造出一個對象出來，然後再通過readFields方法來複原這個對象。

DataOutput也是一種流，只不過是hadoop的在封裝，自己用的時候，裏面需要加個FileOutputStream對象

DataOutput寫字符串的時候要用writeUTF(“字符串”),他這樣編碼的時候，會在字符串的前面先加上字符串的長度，這是考慮到字符編碼對其的問題，hadoop解析的時候就會先讀前面兩個字節，看一看這個字符串有多長，不然如果用write(字符串.getBytes())這樣他不知道這個字符串到底有多少個字節。

在reduce階段，如果把一個對象寫到hdfs裏面，那麼會調用字符串的toString方法，你可以重寫這個類的toString方法

舉例，下面這個類就可以在hadoop裏序列化

package mapreduce2;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys.Write;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.Waitable;

public class FlowBean implements Writable {

    private int up;//上行流量
    private int down;//下行流量
    private int sum;//總流量
    private String phone;//電話號
    
    public FlowBean(int up, int down, String phone) {
        this.up = up;
        this.down = down;
        this.sum = up + down;
        this.phone = phone;
    }
    public int getUp() {
        return up;
    }
    public void setUp(int up) {
        this.up = up;
    }
    public int getDown() {
        return down;
    }
    public void setDown(int down) {
        this.down = down;
    }
    public int getSum() {
        return sum;
    }
    public void setSum(int sum) {
        this.sum = sum;
    }
    public String getPhone() {
        return phone;
    }
    public void setPhone(String phone) {
        this.phone = phone;
    }
    @Override
    public void readFields(DataInput di) throws IOException {
        //注意這裏讀的順序要和寫的順序是一樣的
        this.up = di.readInt();
        this.down = di.readInt();
        this.sum = this.up + this.down;
        this.phone = di.readUTF();
    }
    @Override
    public void write(DataOutput Do) throws IOException {
        Do.writeInt(this.up);
        Do.writeInt(this.down);
        Do.writeInt(this.sum);
        Do.writeUTF(this.phone);
    }
    @Override
    public String toString() {
        return "電話號"+this.phone+" 總流量"+this.sum;
    }
}

當所有的reduceTask都運行完之後，還會調用一個cleanup方法

應用練習：統計一個頁面訪問總量爲n條的數據

方案一：只用一個reducetask，利用cleanup方法，在reducetask階段，先不直接放到hdfs裏面，而是存到一個Treemap裏面

再在reducetask結束後，在cleanup裏面通過把Treemap裏面前五輸出到HDFS裏面；

package cn.edu360.mr.page.topn;

public class PageCount implements Comparable<PageCount>{
    
    private String page;
    private int count;
    
    public void set(String page, int count) {
        this.page = page;
        this.count = count;
    }
    
    public String getPage() {
        return page;
    }
    public void setPage(String page) {
        this.page = page;
    }
    public int getCount() {
        return count;
    }
    public void setCount(int count) {
        this.count = count;
    }

    @Override
    public int compareTo(PageCount o) {
        return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count;
    }
    
    

}

map類

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class PageTopnMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split(" ");
        context.write(new Text(split[1]), new IntWritable(1));
    }

}

reduce類

package cn.edu360.mr.page.topn;

import java.io.IOException;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class PageTopnReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    TreeMap<PageCount, Object> treeMap = new TreeMap<>();
    
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
            count += value.get();
        }
        PageCount pageCount = new PageCount();
        pageCount.set(key.toString(), count);
        
        treeMap.put(pageCount,null);
        
    }
    @Override
    protected void cleanup(Context context)
            throws IOException, InterruptedException {
        Configuration conf = context.getConfiguration();
　　　　//可以在cleanup裏面拿到configuration，從裏面讀取要拿前幾條數據
        int topn = conf.getInt("top.n", 5);
        
        
        Set<Entry<PageCount, Object>> entrySet = treeMap.entrySet();
        int i= 0;
        
        for (Entry<PageCount, Object> entry : entrySet) {
            context.write(new Text(entry.getKey().getPage()), new IntWritable(entry.getKey().getCount()));
            i++;
            if(i==topn) return;
        }   
    }
}

然後jobSubmit類，注意這個要設定Configuration，這裏面有幾種方法

第一種是加載配置文件

        Configuration conf = new Configuration();
        conf.addResource("xx-oo.xml");

然後再在xx-oo.xml文件裏面寫

<configuration>
    <property>
        <name>top.n</name>
        <value>6</value>
    </property>
</configuration>

第二種方式

　　　　//通過直接設定
        conf.setInt("top.n", 3);
        //通過對java主程序 直接傳進來的參數
        conf.setInt("top.n", Integer.parseInt(args[0]));

第三種方式通過獲取配置文件參數

　　　　 Properties props = new Properties();
        props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties"));
        conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));

然後再在topn.properties裏面配置參數

top.n=5

subsubmit類，默認在本機模擬運行

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JobSubmitter {

    public static void main(String[] args) throws Exception {

        /**
         * 通過加載classpath下的*-site.xml文件解析參數
         */
        Configuration conf = new Configuration();
        conf.addResource("xx-oo.xml");
        
        /**
         * 通過代碼設置參數
         */
        //conf.setInt("top.n", 3);
        //conf.setInt("top.n", Integer.parseInt(args[0]));
        
        /**
         * 通過屬性配置文件獲取參數
         */
        /*Properties props = new Properties();
        props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties"));
        conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));*/
        
        Job job = Job.getInstance(conf);

        job.setJarByClass(JobSubmitter.class);

        job.setMapperClass(PageTopnMapper.class);
        job.setReducerClass(PageTopnReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\url\\input"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\url\\output"));

        job.waitForCompletion(true);

    }
}

有時一個任務一個mapreduce是完成不了的，有可能會拆分成兩個或多個mapreduce

map階段會有自己的排序機制，比如一組數據(a,1),(b,1),(a,1),(c,1),他會先處理key爲1的一組數據，

這個排序機制我們也可以自己去實現，要對這個類實現Comparable接口，然後重寫compareTo方法。

但要注意這個排序機制只是對於一個reducetask來說的，如果有多個的話，只會得到局部排序。

如果要多個reducetask的話，我們就需要控制數據的分發規則，這樣雖然是會生成多個排序後的文件，但這些文件整體上依然是有序的。因爲我們控制了每一個reducetask處理數據的範圍。

額外java知識點補充

Treemap，放進去的東西會自動排序

兩種Treemap的自定義方法，第一種是傳入一個Comparator

public class TreeMapTest {
    
    public static void main(String[] args) {
        
        TreeMap<FlowBean, String> tm1 = new TreeMap<>(new Comparator<FlowBean>() {
            @Override
            public int compare(FlowBean o1, FlowBean o2) {
                //如果兩個類總流量相同的會比較電話號
                if( o2.getAmountFlow()-o1.getAmountFlow()==0){
                    return o1.getPhone().compareTo(o2.getPhone());
                }
                //如果流量不同，就按從小到大的順序排序
                return o2.getAmountFlow()-o1.getAmountFlow();
            }
        });
        FlowBean b1 = new FlowBean("1367788", 500, 300);
        FlowBean b2 = new FlowBean("1367766", 400, 200);
        FlowBean b3 = new FlowBean("1367755", 600, 400);
        FlowBean b4 = new FlowBean("1367744", 300, 500);
        
        tm1.put(b1, null);
        tm1.put(b2, null);
        tm1.put(b3, null);
        tm1.put(b4, null);
        //treeset的遍歷
        Set<Entry<FlowBean,String>> entrySet = tm1.entrySet();
        for (Entry<FlowBean,String> entry : entrySet) {
            System.out.println(entry.getKey() +"\t"+ entry.getValue());
        }
    }

}

第二種是在這個類中，實現一個Comparable接口

package cn.edu360.mr.page.topn;

public class PageCount implements Comparable<PageCount>{
    
    private String page;
    private int count;
    
    public void set(String page, int count) {
        this.page = page;
        this.count = count;
    }
    
    public String getPage() {
        return page;
    }
    public void setPage(String page) {
        this.page = page;
    }
    public int getCount() {
        return count;
    }
    public void setCount(int count) {
        this.count = count;
    }

    @Override
    public int compareTo(PageCount o) {
        return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count;
    }
    
    

}

Hadoop學習(4)-mapreduce的一些注意事項

學習筆記（每天持續更新）

2021-致新的一年的自己

springboot一些註解

SpringCloud第七天

SpringCloud第六天

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結