Hive-0.5中SerDe概述

Hive-0.5中SerDe概述
propertiesobjecttablestringnullstructure
一、背景

1、當進程在進行遠程通信時,彼此可以發送各種類型的數據,無論是什麼類型的數據都會以二進制序列的形式在網絡上傳送。發送方需要把對象轉化爲字節序列纔可在網絡上傳輸,稱爲對象序列化;接收方則需要把字節序列恢復爲對象,稱爲對象的反序列化。

2、Hive的反序列化是對key/value反序列化成hive table的每個列的值。

3、Hive可以方便的將數據加載到表中而不需要對數據進行轉換,這樣在處理海量數據時可以節省大量的時間。

二、技術細節

1、SerDe是Serialize/Deserilize的簡稱,目的是用於序列化和反序列化。

2、用戶在建表時可以用自定義的SerDe或使用Hive自帶的SerDe,SerDe能爲表指定列,且對列指定相應的數據。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  [(col_name data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type
    [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...)
  [SORTED BY (col_name [ASC|DESC], ...)]
  INTO num_buckets BUCKETS]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION hdfs_path]
創建指定SerDe表時,使用row format row_format參數,例如:

a、添加jar包。在hive客戶端輸入:hive>add jar  /run/serde_test.jar;
或者在linux shell端執行命令:${HIVE_HOME}/bin/hive  -auxpath  /run/serde_test.jar
b、建表:create table serde_table row format serde  'hive.connect.TestDeserializer';
3、編寫序列化類TestDeserializer。實現Deserializer接口的三個函數:

a)初始化:initialize(Configuration conf, Properties tb1)。

b)反序列化Writable類型返回Object:deserialize(Writable blob)。

c)獲取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

public interface Deserializer {

  /**
   * Initialize the HiveDeserializer.
   * @param conf System properties
   * @param tbl  table properties
   * @throws SerDeException
   */
  public void initialize(Configuration conf, Properties tbl) throws  SerDeException;
 
  /**
   * Deserialize an object out of a Writable blob.
   * In most cases, the return value of this function will be  constant since the function
   * will reuse the returned object.
   * If the client wants to keep a copy of the object, the client  needs to clone the
   * returned value by calling  ObjectInspectorUtils.getStandardObject().
   * @param blob The Writable object containing a serialized object
   * @return A Java object representing the contents in the blob.
   */
  public Object deserialize(Writable blob) throws SerDeException;

  /**
   * Get the object inspector that can be used to navigate through  the internal
   * structure of the Object returned from deserialize(...).
   */
  public ObjectInspector getObjectInspector() throws SerDeException;

}
實現一行數據劃分成hive表的time,userid,host,path四個字段的反序列化類。例如:

package hive.connect;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.Deserializer;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import  org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import  org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {
   private static List<String> FieldNames = new ArrayList<String>();
   private static List<ObjectInspector> FieldNamesObjectInspectors =  new ArrayList<ObjectInspector>();
   static {
     FieldNames.add("time");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(Long.class,
               ObjectInspectorOptions.JAVA));
     FieldNames.add("userid");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(Integer.class,
               ObjectInspectorOptions.JAVA));
     FieldNames.add("host");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(String.class,
               ObjectInspectorOptions.JAVA));

     FieldNames.add("path");
     FieldNamesObjectInspectors.add(ObjectInspectorFactory
          .getReflectionObjectInspector(String.class,
               ObjectInspectorOptions.JAVA));

   }

   @Override
   public Object deserialize(Writable blob) {
     try {
        if (blob instanceof Text) {
          String line = ((Text) blob).toString();
          if (line == null)
             return null;
          String[] field = line.split("/t");
          if (field.length != 3) {
             return null;
          }
          List<Object> result = new ArrayList<Object>();
          URL url = new URL(field[2]);
          Long time = Long.valueOf(field[0]);
          Integer userid = Integer.valueOf(field[1]);
          result.add(time);
          result.add(userid);
          result.add(url.getHost());
          result.add(url.getPath());
          return result;
        }
     } catch (MalformedURLException e) {
        e.printStackTrace();
     }
     return null;
   }

   @Override
   public ObjectInspector getObjectInspector() throws SerDeException {
     return ObjectInspectorFactory.getStandardStructObjectInspector(
          FieldNames, FieldNamesObjectInspectors);
   }

   @Override
   public void initialize(Configuration arg0, Properties arg1)
        throws SerDeException {
   }

}
測試HDFS上hive表數據,如下爲一條測試數據:

1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar;                                           
Added /run/jar/merg_hua.jar to class path
hive> create table serde_table row format serde 'hive.connect.TestDeserializer';
Found class for hive.connect.TestDeserializer
OK
Time taken: 0.028 seconds
hive> describe serde_table;
OK
time    bigint  from deserializer
userid  int     from deserializer
host    string  from deserializer
path    string  from deserializer
Time taken: 0.042 seconds
hive> select * from serde_table;
OK
1234567891012   123456  wiki.apache.org /hadoop/Hive/LanguageManual/UDF
Time taken: 0.039 seconds
三、總結
1、創建Hive表使用序列化時,需要自寫一個實現Deserializer的類,並且選用create命令的row format參數。

2、在處理海量數據的時候,如果數據的格式與表結構吻合,可以用到Hive的反序列化而不需要對數據進行轉換,可以節省大量的時間。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章