MapReduce中的SequenceFile和MapFile

SequeceFile是Hadoop API提供的一種二進制文件支持。這種二進制文件直接將<key, value>對序列化到文件中。一般

對小文件可以使用這種文件合併，即將文件名作爲key，文件內容作爲value序列化到大文件中。這種文件格式有以下好

處：

支持壓縮，且可定製爲基於Record或Block壓縮（Block級壓縮性能較優）

本地化任務支持：因爲文件可以被切分，因此MapReduce任務時數據的本地化情況應該是非常好的。

對key、value的長度進行了定義，(反)序列化速度非常快。

缺點是需要一個合併文件的過程，文件較大，且合併後的文件將不方便查看，必須通過遍歷查看每一個小文件。

SequenceFile存儲結構

對SequenceFile進行讀寫

Configuration configuration = new Configuration();
		FileSystem fileSystem = FileSystem.get(new URI("hdfs://hadoop1:9000"), configuration);
		
		//寫操作 向hdfs寫入文件sf
		Writer writer = SequenceFile.createWriter(fileSystem, configuration, new Path("/sf"), LongWritable.class, Text.class);
		for (int i = 0; i < 4; i++) {
			writer.append(new LongWritable(i), new Text(i + "xxxx"));
		}
		IOUtils.closeStream(writer);
/讀操作
final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("/sf"), conf);
final LongWritable key = new LongWritable();
final Text val = new Text();
while (reader.next(key, val)) {
    System.out.println(key.get()+"\t"+val.toString());
}
IOUtils.closeStream(reader);

/**
 * 將大量小文件以key-value形式合併存儲爲SequenceFile 
 * key爲小文件名，value爲小文件內容
 * @author Administrator
 *
 */
public class SequenceFileTest2 {
	public static void main(String[] args) throws Exception {
		Configuration configuration = new Configuration();
		FileSystem fileSystem = FileSystem.get(new URI("hdfs://hadoop1:9000"), configuration);
		
		//寫操作 向hdfs寫入文件sf
		Writer writer = SequenceFile.createWriter(fileSystem, configuration, new Path("/sf"), LongWritable.class, Text.class);
		//遍歷/usr/local/logs下以log結尾的日誌文件，並且不遞歸遍歷子文件夾
		Collection<File> listFiles = FileUtils.listFiles(new File("/usr/local/logs"), new String[]{"log"}, false);
		for (File file : listFiles) {
			//文件名爲key，文件名爲value
			writer.append(new Text(file.getName()), new BytesWritable(FileUtils.readFileToByteArray(file)));
		}
		IOUtils.closeStream(writer);
                
                //將SequenceFile中的小文件再分別寫出，分別以原來的文件名命名
        SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem, new Path("/sf"), configuration);
        Text key = new Text();
        BytesWritable value = new BytesWritable();
        while(reader.next(key, value)){
            String fileName = "/usr/local/new_logs/" + key.toString();
            File file = new File(fileName);
            FileUtils.writeByteArrayToFile(file, value.getBytes());
        }
        IOUtils.closeStream(reader);
        }
}

MapFile

MapFile是排序後的SequenceFile,通過觀察其目錄結構可以看到MapFile由兩部分組成，分別是data和index。

index作爲文件的數據索引，主要記錄了每個Record的key值，以及該Record在文件中的偏移位置。在MapFile被訪問

的時候,索引文件會被加載到內存，通過索引映射關係可迅速定位到指定Record所在文件位置，因此，相對

SequenceFile而言，MapFile的檢索效率是高效的，缺點是會消耗一部分內存來存儲index數據。

需注意的是，MapFile並不會把所有Record都記錄到index中去，默認情況下每隔128條記錄存儲一個索引映射。當

然，記錄間隔可人爲修改，通過MapFIle.Writer的setIndexInterval()方法，或修改io.map.index.interval屬性；

另外，與SequenceFile不同的是，MapFile的KeyClass一定要實現WritableComparable接口,即Key值是可比較的。

final Configuration conf = new Configuration();
final FileSystem fs = FileSystem.get(new URI("hdfs://hadoop0:9000/"), conf);
//寫數據
final MapFile.Writer writer = new MapFile.Writer(conf, fs, "/aaa", Text.class, Text.class);
writer.append(new Text("1"), new Text("aa"));
IOUtils.closeStream(writer);
//讀數據
final MapFile.Reader reader = new MapFile.Reader(fs, "/aaa", conf);
final Text key = new Text();
final Text val = new Text();
while(reader.next(key, val)) {
	System.out.println(key.toString()+"\t"+val.toString());
}

MapReduce中的SequenceFile和MapFile

SQL優化-20231016

MapReduce中的SequenceFile和MapFile

Redis學習2

查看fsimage和edits文件

hadoop2.6.0 shell常用操作

Hive0.14安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結