1. 數據完整性:任何語言對IO的操作都要保持其數據的完整性。hadoop當然希望數據在存儲和處理中不會丟失或損壞。檢查數據完整性的常用方法是校驗和。
- HDFS的數據完整性:客戶端在寫或者讀取HDFS的文件時,都會對其進行校驗和驗證,當然我們可以通過在Open()方法讀取之前,將false傳給FileSystem中的setVerifyCheckSum()來禁用校驗和。
- 本地文件系統,hadoop的本地文件系統執行客戶端校驗,這意味着,在寫一個filename文件時,文件系統的客戶端以透明方式創建了一個隱藏的文件.filename.crc,塊的大小做爲元數據存於此,所以讀取文件時會進行校驗和驗證。
- ChecksumFileSystem:可以通過它對其數據驗證。
2. 壓縮:壓縮後能夠節省空間和減少網絡中的傳輸。所以在hadoop中壓縮是非常重要的。hadoop的壓縮格式
壓縮格式 | 算法 | 文件擴展名 | 多文件 | 可分割性 |
DEFLATEa | DEFLATE | .deflate | no | no |
gzip(zip) | DEFLATE | .gz(.zip) | no(yes) | no(yes) |
bzip2 | bzip2 | .bz2 | no | yes |
LZO | LZO | .lzo | no | no |
- 編碼/解碼
/**
* @param args
*/
public static void main(String[] args) throws Exception
{
// TODO Auto-generated method stub
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration configuration = new Configuration();
CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, configuration);
CompressionOutputStream outputStream = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, outputStream, 4096,false);
outputStream.finish();
}
-
壓縮和分割:因爲HDFS默認是以塊的來存儲數據的,所以在壓縮時考慮是否支持分割時非常重要的。
-
在MapReduce使用壓縮:例如要壓縮MapReduce作業的輸出,需要將配置文件中mapred.output.compress的屬性設置爲true
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCompression <input path> " +
"<output path>");
System.exit(-1);
}
JobConf conf = new JobConf(MaxTemperatureWithCompression.class);
conf.setJobName("Max temperature with output compression");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
/*[*/conf.setBoolean("mapred.output.compress", true);
conf.setClass("mapred.output.compression.codec", GzipCodec.class,
CompressionCodec.class);/*]*/
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setCombinerClass(MaxTemperatureReducer.class);
conf.setReducerClass(MaxTemperatureReducer.class);
JobClient.runJob(conf);
}
3.序列化:將字節流和機構化對象的轉化。hadoop是進程間通信(RPC調用),PRC序列號結構特點:緊湊,快速,可擴展,互操作,hadoop使用自己的序列化格式Writerable,
-
Writerable接口:
package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable {
void write(DataOutput out) throws IOException;// 將序列化流寫入DataOutput
void readFields(DataInput in) throws IOException; //從DataInput流讀取二進制
}
package WritablePackage;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;
import org.hsqldb.lib.StringUtil;
public class WritableTestBase
{
public static byte[] serialize(Writable writable) throws IOException
{
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
DataOutputStream dataOutputStream = new DataOutputStream(outputStream);
writable.write(dataOutputStream);
dataOutputStream.close();
return outputStream.toByteArray();
}
public static byte[] deserialize(Writable writable,byte[] bytes) throws IOException
{
ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes);
DataInputStream dataInputStream = new DataInputStream(inputStream);
writable.readFields(dataInputStream);
dataInputStream.close();
return bytes;
}
public static String serializeToString(Writable src) throws IOException
{
return StringUtils.byteToHexString(serialize(src));
}
public static String writeTo(Writable src, Writable des) throws IOException
{
byte[] data = deserialize(des, serialize(src));
return StringUtils.byteToHexString(data);
}
}
Writerable 類
Java primitive Writable implementation Serialized size (bytes)
boolean BooleanWritable 1
byte ByteWritable 1
int IntWritable 4
VIntWritable 1–5
float FloatWritable 4
long LongWritable 8
VLongWritable 1–9
4. 基於文件的數據結構
-
SequenceFile類:是二進制鍵/值對提供一個持久化的數據結構。SequenceFile的讀取和寫入。
package WritablePackage;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.SequenceFile;
public class SequenceFileWriteDemo
{
private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException
{
// TODO Auto-generated method stub
String url = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(url),conf);
Path path = new Path(url);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer =null;
try
{
writer = SequenceFile.createWriter(fs,conf,path,key.getClass(),value.getClass());
for(int i = 0; i< 100; i++)
{
key.set(100-i);
value.set(DATA[i%DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key,value);
}
}
catch (Exception e)
{
// TODO: handle exception
}
finally
{
IOUtils.closeStream(writer);
}
}
}
package WritablePackage;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
public class SequenceFileReadDemo
{
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException
{
// TODO Auto-generated method stub
String url = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(url),conf);
Path path = new Path(url);
SequenceFile.Reader reader = null;
try
{
reader = new SequenceFile.Reader(fs,path,conf);
Writable key =(Writable)ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value =(Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while(reader.next(key,value))
{
String syncSeen = reader.syncSeen()? "*":"";
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
position = reader.getPosition(); // beginning of next record
}
}
finally
{
IOUtils.closeStream(reader);
}
}
}
-
MapFile 是經過排序的帶索引的sequenceFile,可以根據鍵值進行查找,MapFile可以被任務是java.util.map一種持久化形式。注意它必須按順序添加條目。