總結:
1)常見的數據壓縮算法
DEFLATE bzip2 gzip snappy
2)使用native壓縮庫算法
snappy 安裝 libsnappy.so libhadoop.so
gzip deflate 安裝zlib libhadoop.so
3)CompressCodec進行操作
壓縮: createOutputStream 來獲得 CompreesionOutputStream
解壓縮:createInputStream 來獲得CompressionInputStreamy
解壓縮:
public static void main(String[] args) throws FileNotFoundException, IOException {
// TODO Auto-generated method stub
String file = "/home/README.txt.gz";
Configuration conf = new Configuration();
CompressionCodecFactory codecFactory = new CompressionCodecFactory(conf);
CompressionCodec codec= codecFactory.getCodec(new Path(file));
CompressionInputStream in = codec.createInputStream(new FileInputStream(new File(file)));
FileOutputStream out = new FileOutputStream(new File(codecFactory.removeSuffix(file, codec.getDefaultExtension())));
IOUtils.copyBytes(in, out, 4096);
in.close();
out.close();
}
HDFS數據完整性
1)HDFS以透明方式校驗所有寫入的數據,針對數據的io.bytes.per.checksum()字節創建一個獨立的校驗和,
如果節點檢測數據錯誤,就會ChecSumException異常
2)除了在讀取數據時進行驗證,數據節點也會在後臺運行一個進程,DataBlockScanner(數據塊檢測程序)來驗證在數據節點上的所有的塊
3)一旦檢測到corrupt block,在heartbeat階段,DN會收到NN發來的Block Command,從其他數據庫中拷貝一份新的replica
使用checksumFileSystem
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
LocalFileSystem locaFs = ChecksumFileSystem.getLocal(conf);
System.out.println(locaFs.getChecksumFile(new Path("/home/README.txt")));
}