hadoop權威指南筆記

combiner函數 集羣的可用帶寬限制了MapReduce作業數量，因此應該儘量避免兩者之間數據傳輸是有利的
map輸出指定一個combiner，對數據量進行減少，最後再傳遞給reduce

core-site.xml
<property>
    <name>fs.defaultFS</name>   用於設置默認文件系統，指定namenode主機以及端口
    <value>hdfs://master:9000</value>    
</property>
因爲只有2個Slave，所以dfs.replication的值設爲2。
	dfs.namenode.replication.min  設爲1，則寫操作就會成功，最終會保證達到replication數2
	    這裏爲異步操作
	第一個複本在客戶端(客戶端在集羣外，則機架隨機一個)，2.3都隨機同一個機架，
	都會避免快滿或忙的節點 其它的放在集羣中隨機選擇的節點，系統儘量避免同一機架


文件hdfs-site.xml
<property>
    <name>dfs.replication</name>   datanode個數
    <value>2</value>
</property>





hdfs dfs -copyFromLocal /home/hadoop/input/1.txt \ hdfs://localhost/user/root/input
等價於         因爲core-site.xml已經指定了hdfs://master:9000
hdfs dfs -copyFromLocal /home/hadoop/input/1.txt input

hdfs dfs -copyToLocal output /home/hadoop


hadoop fs -ls output
權限  備份數  ...



從文件系統讀取數據；

InputStream in = null；
try{
	in = new URL("hdfs://host/path").openStream();  //java.new.URL
	//process in
}finally{
	IOUtils.closeStream(in);
}

public class URLCat{  //每個jvm，setURLStreamHandlerFactory只能調用一次
	static { //通過FsUrlStreamHandler實例獲取文件
		URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
	}
	public static void main(String args[]){
		InputStream in = null；
		try{
			in = new URL(args[0]).openStream();  //java.new.URL
			IOUtils.copyBytes(in,System.out,4096,false);
		}finally{
			IOUtils.closeStream(in);
		}
	}
}  //hadoop URLCat hdfs://localhost/user/root/input/1.txt




FileSystem 獲取方法                                    core-site.xml配置文件
public static FileSystem get(Configration conf);  //conf從設置配置文件讀取類路徑
public static FileSystem get(URI uri,Configration conf);//若未指定，返回默認文件系統
public static FileSystem get(URI uri,Configration conf,String user);//給指定用戶訪問

public static LocalFileSystem getLocal(Configration conf);//獲取本地文件系統

通過FileSystem實例調用open獲取輸入流
public FSDataInputStream open(Path f);   //bufferSize default 4kb
public abstract FSDataInputStream open(Path f,int bufferSize);

直接用FileSystem以標準格式輸出顯示文件系統中文件
public class FileSystemCat{
	public static void public static void main(String[] args) {
		String uri = args[0];
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(uri),conf);
		InputStream in = null；
		try{
			in = fs.open(new Path(uri));  //open(),返回對象爲FSDataInputStream
			IOUtils.copyBytes(in,System.out,4096,false);
		}finally{
			IOUtils.closeStream(in);
		}
	}
}// hadoop FileSystemCat hdfs://localhost/user/root/input/1.txt
// in.seek(0);返回流的起點 ，支持隨機訪問


從文件系統寫入數據；
public FSDataInputStream create(Path f );
//有多個重載版本，指定是否強制覆蓋、備份數量、緩衝區大小、文件塊大小、文件權限
//需要先判定 父目錄是否存在eixsts()
追加數據
public FSDataInputStream append(Path f);  //併發所有文件系統支持，如S3不支持

public class FileCopyWithProgress{
	public static void main(String[] args) {
		String loaclSrc = args[0];
		String dst = args[1];
		InputStream in = new BufferedInputStream(new FileInputStream(loaclSrc));
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(dst),conf);
		OutputStream out  = fs.create(new Path(dst),new Progress(){
			public void progress(){
				System.out.print("callback!");
			}
		});//FSDataInputStream,不允許在文件中定位，即只允許順序輸入或末尾追加數據。
		IOUtils.copyBytes(in,out,4096,true);
	}
}//hadoop FileCopyWithProgress input/docs/1.txt hdfs://localhost/user/root/input/1.txt

目錄
public boolean mkdirs(Path f);//新建所有必要但是沒有的父目錄

查詢文件系統
1.文件元數據：FileStatus  
	封裝了文件系統中文件和目錄的元數據，包括長度、塊大小、複本、修改時間、所有者、權限
getFileStatus(Path f);)

2.列出文件
public FileStatus[] listStatus(Path f);
public FileStatus[] listStatus(Path f,PathFilter filter);
public FileStatus[] listStatus(Path[] files);
public FileStatus[] listStatus(Path[] files,PathFilter filter);

public class ListStatus{
	public static boolean downLoadHDFS(String hdfsSrc, String localDst) throws IOException {
        Configuration conf = new Configuration();
        Path dstpath = new Path(hdfsSrc);
        int i = 1;
        FileSystem fs = FileSystem.get(URI.create(hdfsSrc), conf);
        try {
            String subPath = "";
            FileStatus[] fList = fs.listStatus(dstpath);
            for (FileStatus f : fList) {
                if (null != f) {
                    subPath = new StringBuffer()
                            .append(f.getPath().getParent()).append("/")
                            .append(f.getPath().getName()).toString();
                    if (f.isDir()) {
                        downLoadHDFS(subPath, localDst);
                    } else {
                        System.out.println("/t/t" + subPath);// hdfs://54.0.88.53:8020/
                        Path dst = new Path(subPath);
                        i++;
                        try {
                            Path Src = new Path(subPath);
                            String Filename = Src.getName().toString();
                            String local = localDst + Filename;
                            Path Dst = new Path(local);
                            FileSystem localFS = FileSystem.getLocal(conf);
                            FileSystem hdfs = FileSystem.get(URI
                                    .create(subPath), conf);
                            FSDataInputStream in = hdfs.open(Src);
                            FSDataOutputStream output = localFS.create(Dst);
                            byte[] buf = new byte[1024];
                            int readbytes = 0;
                            while ((readbytes = in.read(buf)) > 0) {
                                output.write(buf, 0, readbytes);
                            }
                            in.close();
                            output.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                            System.out.print(" download failed.");
                        } finally {
                        }
                    }
                }
            }
        } catch (Exception e) {
        } finally {
            System.out.println("the number of files is :" + i);
        }
        return true;
    }
}

public boolean delete(Path f,boolean recursive);
若f爲文件或者空目錄，recursive會被忽略，只有在recursive爲true時，非空目錄及其內容纔會被刪除否則拋異常


3.文件模式
public class regex implements PathFilter{
	private final String regex;
	PathFilter(String regex){
		this.regex= regex;
	}
	public boolean accept(Path path){
		return !path.toString().matches(regex);
	}
}

刪除數據
public boolean delete(Path f,boolean recursive);
若f爲文件或者空目錄，recursive會被忽略，只有在recursive爲true時，非空目錄及其內容纔會被刪除否則拋異常


數據流，通過網絡拓撲，將數據塊與客戶端距離進行排序，通過對InputStream反覆調用read()方法，將datanode
傳輸到客戶端，當到達末端時，InputStream關閉連接，然後尋找下一塊最佳datanode。
距離有4種，同一進程、同一機架、同一數據中心、不同數據中心  1，2，4，6

一致模型  hflush()僅保證在datanode內存中，減輕hadoop負載  hsync()強制寫入磁盤
  文件系統保證新建文件立馬可見，但是寫入文件內容不一定立刻可見，即使已經刷新並存儲(僅保證在datanode內存中)
總之正在寫入的塊對其它reader不可見。


並行複製 hadoop distcp file1 file2                   distcp本質上是一個map作業
 		hadoop distcp dir1 dir2    dir2不存在則會新建，並且可以指定多個源路徑
                                  dir不存在，則目錄dir1會複製到dir2目錄下 /dir2/dir1
-overwrite   -update		僅當更新發送變化的時候才複製
保持負載均衡，保證每個節點20個map   通過 -m 指定




#yarn 集羣資源管理系統
ResourceManager
NodeManager

資源請求
	當請求多個容器時，可以指定每個容器需要的計算資源數量(CPU、RAM)還可以指定對容器的本地限制
  可以限制容器申請指定節點或機架或任何位置的容器，有時當限制無法被滿足時，會自動放寬
  	yarn應用可以在運行中的任意時刻提出資源申請，動態調整

應用生命週期   幾秒到幾個月都有
模型	
	1.最簡單的是，一個用戶作業對應一個應用
	2.每個工作流或每個用戶對話對應一個應用，效率比1高，容器在作業之間可以重用 spark
	3.多用戶共享一個長期運行的應用作爲協調者的角色運行(application master)

構建yarn應用
	MapReduce1 jobtracter負責作業調度和任務進度監控，       tasktracter負責工作  Slot
	MapReduce2 資源管理器、applicationmaster、時間軸服務器     節點管理器         容器

MapReduce1 每個tasktracter都配置有固定長度map，reduce的Slot，且不能重用
MapReduce2 可重用 ，將MapReduce變爲了yarn應用

YARN有3種調度器：FIFO調度器、容量、公平。
	FIFO調度器：簡單易懂、不需任何配置，但是不適合共享集羣，大應用會佔用集羣所有資源
	容量：能保證長時間作業能及時完成，同時小作業能在合理時間得到結果，與FIFO相比，大作業
	      執行時間更長點
	公平：(特性同容量第一行)不需要預留一定量的資源，


容量調度器配置
	將資源分爲多個隊列，每個隊列佔用的資源比例不同，並且可以設置 最大資源，隊列可以動態申請
	超過資源比例，但是不能超過最大資源限制。

capacity-scheduler.xml