hadoop初步介紹：hdfs分佈存儲+ mr分佈計算

原創

2018-10-21 07:12

hdfs 和RDBMS區別
mr 和網格計算，志願計算

1，數據存儲

	磁盤存儲	解決分佈式問題	硬件需求	系統瓶頸
hdfs	磁盤陣列-集羣	硬件故障，多數據源的數據準確性	普通機	數據傳輸：硬盤帶寬
RDBMS	單磁盤		專業服務器	磁盤尋址：大量數據更新

2，分析計算

	適用場	特點	生態圈	結構特點	數據完整性	可擴展性	數據集結構化程度
mr	PB級數據：批處理	一寫多讀	yarn集成其他分佈式程序,hive,saprk	讀模式	低	高	半、非結構化
RDBMS	GB級數據：實時檢索，更新	持續更新		寫模式	高	低	結構化

3，網格計算，志願計算

	特點	適用場景
網格計算	分散節點計算+ 網絡共享文件系統	小規模數據：無網絡傳輸瓶頸
網格計算	任務單元化+ 分散計算+ 校驗結果	cup密集型：計算時間>傳輸時間
mr	轉移計算+ 數據本地化	作業週期短（小時計）,高速局域網內，高配硬件

4，mr 對比linux:awk流處理

4.1，awk處理：年度最高溫度統計

4. 2，mapreduce處理：每年最高溫度統計

a, ruby寫mapreduce:

b, java 寫mapreduce ====>idea +maven: 添加依賴

	<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
        </dependency>

map方法

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class Map1 extends Mapper<LongWritable, Text, IntWritable,IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //整理的數據輸入：
        //1982,-8
        //1931,-4
        String str = value.toString();
        String[] arr = str.split(",");
        int year=0, tmp=Integer.MIN_VALUE;

        //數據轉換
        try {
             year= Integer.parseInt(arr[0]);
             tmp= Integer.parseInt(arr[1]);
        }catch (Exception e){
            e.printStackTrace();
        }
        //輸出：新數據
        context.write(new IntWritable(year),new IntWritable(tmp));
    }
}

reduce方法

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;

public class Reduce1 extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //輸入數據：1931,【-4,23,4,35,6】
        //聚合數據： 求每組數據中的max(tmp)
        int max=Integer.MIN_VALUE;
        Iterator<IntWritable> it = values.iterator();
        while (it.hasNext()){
            IntWritable next = it.next();
            int tmp = next.get();

            max= (max >tmp) ? max:tmp;
        }
        //輸出： 最高溫度
        context.write(key, new IntWritable(max));
    }
}

app類：調度組織job

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class App1 {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(App1.class);
        job.setJobName("maxTmp");

        //map,reduce
        job.setMapperClass(Map1.class);
        job.setReducerClass(Reduce1.class);

        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);

        //map端： 預聚合
        job.setCombinerClass(Reduce1.class);
        job.setNumReduceTasks(3);
        //輸入輸出
        FileInputFormat.addInputPath(job,new Path("/home/wang/txt/tmp.txt"));
        //刪除已存在的目錄,以防報錯
        FileSystem fs = FileSystem.get(conf);
        Path outPath = new Path("/home/wang/tmp-out");
        if(fs.exists(outPath))
            fs.delete(outPath,true);
        FileOutputFormat.setOutputPath(job,outPath);

        //提交等待
        job.waitForCompletion(true);
    }
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hadoop初步介紹：hdfs分佈存儲+ mr分佈計算

1，數據存儲

2，分析計算

3，網格計算，志願計算

4，mr 對比linux:awk流處理

4.1，awk處理：年度最高溫度統計

4. 2，mapreduce處理：每年最高溫度統計

a, ruby寫mapreduce:

b, java 寫mapreduce ====>idea +maven: 添加依賴

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

hadoop系列： spark 訪問hive表報錯

redis: 初步使用&集羣搭建

樸素貝葉斯分類：使用案例

推薦算法：基於物品的協同過濾算法

推薦算法：基於用戶的協同過濾算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

hadoop初步介紹：hdfs分佈存儲+ mr分佈計算

1，數據存儲

2，分析計算

3，網格計算，志願計算

4，mr 對比linux:awk流處理

4.1，awk處理： 年度最高溫度統計

4. 2，mapreduce處理：每年最高溫度統計

a, ruby寫mapreduce:

b, java 寫mapreduce ====>idea +maven: 添加依賴

4.1，awk處理：年度最高溫度統計