文章目錄

一、Join多種應用

一、Join多種應用

1.1 Reduce Join

Reduce Join工作原理：

Map端的主要工作：爲來自不同表（文件）的key/value對打標籤以區別不同來源的記錄。然後連接字段作爲key，其餘部分和新加的標誌作爲value，最後進行輸出。
Reduce端的主要工作：在Reduce端以連接字段作爲key的分組已經完成，我們只需要在每一個分組當中將那些來源於不同文件的記錄（在Map階段打標誌）分開，最後進行合併即可。

案例：

需求： 將商品信息表中數據根據商品pid合併到訂單數據表中
order.txt:

id	pid	amount
1001	01	1
1002	02	2
1003	03	3
1004	01	4

pd.txt:

pid	pname
01	小米
02	華爲
03	聯想

期望獲得數據：

id	pname	amount
1001	小米	1
1004	小米	4
1002	華爲	2
1003	格力	3

代碼實現：

OrderBean實體：

public class OrderBean implements WritableComparable<OrderBean> {

    private String id;
    private String pid;
    private int amount;
    private String pname;

    public OrderBean() {
    }


    @Override
    public int compareTo(OrderBean o) {
        int compare = this.pid.compareTo(o.pid);
        if (compare == 0) {
            return o.pname.compareTo(this.pname);
        } else {
            return compare;
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
    }
    //省略getter、setter、toString方法
    ...
}

Mapper類：

public class OrderMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {
    private OrderBean orderBean = new OrderBean();
    private String fileName;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit fs = (FileSplit) context.getInputSplit();
        fileName = fs.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");
        //根據文件名來創建OrderBean對象
        if ("order.txt".equals(fileName)){
            orderBean.setId(fields[0]);
            orderBean.setPid(fields[1]);
            orderBean.setAmount(Integer.parseInt(fields[2]));
            orderBean.setPname("");
        }else {
            orderBean.setPid(fields[0]);
            orderBean.setPname(fields[1]);
            orderBean.setId("");
            orderBean.setAmount(0);
        }
        context.write(orderBean,NullWritable.get());
    }
}

Reducer類：

public class OrderReducer extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {
    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        Iterator<NullWritable> vars = values.iterator();
        //指針下移獲取第一個OrderBean
        vars.next();
        String pname = key.getPname();
        while (vars.hasNext()) {
            //指針下移，其對應的key也變化了
            vars.next();
            key.setPname(pname);
            context.write(key, NullWritable.get());
        }
    }
}

分組Comparator類：

public class OrderComparator extends WritableComparator {
    public OrderComparator() {
        super(OrderBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean oa = (OrderBean) a;
        OrderBean ob = (OrderBean) b;
        return oa.getPid().compareTo(ob.getPid());
    }
}

驅動Driver類

public class OrderDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(OrderDriver.class);
        job.setMapperClass(OrderMapper.class);
        job.setReducerClass(OrderReducer.class);

        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);
        //設置分組Comparator
        job.setGroupingComparatorClass(OrderComparator.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\MyFile\\test"));
        //指定_SUCCESS文件的位置
        FileOutputFormat.setOutputPath(job, new Path("d:\\output"));
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

缺點：
Reduce Join合併的操作是在Reduce階段完成的，Reduce端的處理壓力太大，Map節點的運算負載則很低，資源利用率不高，且在Reduce階段極易產生數據傾斜。

解決方案： 使用Map Join

1.2 Map Join

使用場景：

Map Join適用於一張表非常小、另一表非常大的場景。

在Map端緩存多張表，提前處理業務邏輯，這樣增加Map端業務，減少Reduce端數據壓力，儘可能的減少數據傾斜。

實現方式：

DistributedCacheDriver緩存小文件
在Map的setUp()方法中讀取緩存文件

代碼：

Mapper類：

public class OrderMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {
    private OrderBean orderBean = new OrderBean();
    private Map<String, String> pMap = new HashMap<>();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] cacheFiles = context.getCacheFiles();
        String path = cacheFiles[0].getPath();
        /**
         * 使用FSDataInputStream會中文亂碼
         */
//        FileSystem fs = FileSystem.get(context.getConfiguration());
//        FSDataInputStream fis = fs.open(new Path(path));
        BufferedReader fis = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
        String line;
        while (StringUtils.isNotEmpty(line = fis.readLine())) {
            String[] fields = line.split("\t");
            pMap.put(fields[0], fields[1]);
        }
        IOUtils.closeStream(fis);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");
        String pname = pMap.get(fields[1]);
        pname = pname == null ? "" : pname;
        orderBean.setId(fields[0]);
        orderBean.setPid(fields[1]);
        orderBean.setAmount(Integer.parseInt(fields[2]));
        orderBean.setPname(pname);
        context.write(orderBean, NullWritable.get());
    }
}

驅動Driver：

public class OrderDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(OrderDriver.class);
        job.setMapperClass(OrderMapper.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\MyFile\\test"));
        //指定_SUCCESS文件的位置
        FileOutputFormat.setOutputPath(job, new Path("d:\\output"));

        //加載緩存數據
        job.addCacheFile(new URI("file:///d:/MyFile/cache/pd.txt"));
        //Map端Join的邏輯不需要Reduce階段，設置ReduceTask數量爲0
        job.setNumReduceTasks(0);


        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

二、計數器應用

Hadoop爲每個作業維護若干內置計數器，以描述多項指標。例如，某些計數器記錄已處理的字節數和記錄數，使用戶監控已處理的輸入數據量和已產生的輸出數據量。

採用枚舉的方式統計計數

enum MyCounter{MALFORORMED,NORMAL}
//對枚舉定義的自定義計數器加1
context.getCounter(MyCounter.MALFORORMED).increment(1);

採用計數組、計數器名稱的方式統計

context.getCounter("counterGroup","counter").increment(1);

計數結果在程序運行後的控制檯上查看

三、數據清洗(ETL)

在運行核心業務MapReduce程序之前，往往要先對數據進行清洗，清理掉不符合用戶要求的數據。清理的過程往往只需要運行Mapper程序，不需要運行Reduce程序。

需求： 去除日誌中字段長度小於等於11的日誌
Mappper類：

public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
	
	Text k = new Text();
	
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		
		// 1 獲取1行數據
		String line = value.toString();
		
		// 2 解析日誌
		boolean result = parseLog(line,context);
		
		// 3 日誌不合法退出
		if (!result) {
			return;
		}
		
		// 4 設置key
		k.set(line);
		
		// 5 寫出數據
		context.write(k, NullWritable.get());
	}

	// 2 解析日誌
	private boolean parseLog(String line, Context context) {

		// 1 截取
		String[] fields = line.split(" ");
		
		// 2 日誌長度大於11的爲合法
		if (fields.length > 11) {

			// 系統計數器
			context.getCounter("map", "true").increment(1);
			return true;
		}else {
			context.getCounter("map", "false").increment(1);
			return false;
		}
	}
}

驅動Driver:

public class LogDriver {

	public static void main(String[] args) throws Exception {
		// 1 獲取job信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		// 2 加載jar包
		job.setJarByClass(LogDriver.class);

		// 3 關聯map
		job.setMapperClass(LogMapper.class);

		// 4 設置最終輸出類型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);

		// 設置reducetask個數爲0
		job.setNumReduceTasks(0);

		// 5 設置輸入和輸出路徑
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		// 6 提交
		job.waitForCompletion(true);
	}
}

計數效果：

四、MapReduce開發總結

在編寫MapReduce程序時，需要考慮的幾個方面：

①輸入數據接口：InputFormat

默認使用的實現類是：TextInputFormat
TextInputFormat的功能邏輯是：一次讀一行文本，然後將該行的起始偏移量作爲key，行內容作爲value返回
KeyValueTextInputFormat每一行均爲一條記錄，被分隔符分割爲key ，value。默認分隔符是tab (\t)。
NlineInputFormat按照指定的行數N來劃分切片。
CombineTextInputFormat可以把多個小文件合併成一個切片處理，提高處理效率
用戶還可以自定義InputFormat

②邏輯處理接口：Mapper

用戶根據業務需求實現其中三個方法：map()、 setup()、 cleanup ()

③Partitioner分區

有默認實現HashPartitioner，邏輯是根據key的哈希值和numReduces來返回一個分區號；key.hashCode()&Integer.MAXVALUE % numReduces

如果業務上有特別的需求，可以自定義分區。

④Comparable排序

當我們用自定義的對象作爲key來輸出時，就必須要實現WritableComparable接口，重寫其中的compareTo()方法

部分排序：對最終輸出的沒一個文件進行內部排序
全排序：對所有數據進行排序，通常只有一個Reduce
二次排序：排序的條件有兩個
輔助排序：可以讓不同的key進入到同一個ReduceTask

⑤Combiner合併

Combiner合併可以提高程序執行效率，減少IO傳輸。但是使用時必須不能影響原有的業務處理結果。

⑥Reduce端分組：Groupingcomparator

在Reduce端對key進行分組。應用於:在接收的Key爲Bean對象時，想讓一個或幾個字段相同(全部字段比較不相同)的Key進入到同一個Reduce方法時，可以採用分組排序。

⑦邏輯處理接口：Reducer

用戶根據業務需求實現其中三個方法: reduce()、 setup()、 cleanup()

⑧輸出數據接口：OutputFormat

默認實現類是TextOutputFormat，功能邏輯是：將每一個KV對向目標文本文件中輸出爲一行。

用戶還可以自定義OutputFormat。

Hadoop：MapReduce應用

文章目錄

一、Join多種應用

1.1 Reduce Join

1.2 Map Join

二、計數器應用

三、數據清洗(ETL)

四、MapReduce開發總結

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

Hive(五)：企業調優

Kafka(三)：面試題

Flume(一)：概述和企業開發案例

Flume(二)：監控、自定義組件、面試題

HBase(三)：集成Hive、HBase優化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結