Hive 中SerDe概述

一、背景

1、當進程在進行遠程通信時，彼此可以發送各種類型的數據，無論是什麼類型的數據都會以二進制序列的形式在網絡上傳送。發送方需要把對象轉化爲字節序列纔可在網絡上傳輸，稱爲對象序列化；接收方則需要把字節序列恢復爲對象，稱爲對象的反序列化。

2、Hive的反序列化是對key/value反序列化成hive table的每個列的值。

3、Hive可以方便的將數據加載到表中而不需要對數據進行轉換，這樣在處理海量數據時可以節省大量的時間。

二、技術細節

1、SerDe是Serialize/Deserilize的簡稱，目的是用於序列化和反序列化。

2、用戶在建表時可以用自定義的SerDe或使用Hive自帶的SerDe，SerDe能爲表指定列，且對列指定相應的數據。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type

[COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)]

INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

創建指定SerDe表時，使用row format row_format參數，例如：

a、添加jar包。在hive客戶端輸入：hive>add jar /run/serde_test.jar;

或者在linux shell端執行命令：${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar

b、建表：create table serde_table row format serde 'hive.connect.TestDeserializer';

3、編寫序列化類TestDeserializer。實現Deserializer接口的三個函數：

a）初始化：initialize(Configuration conf, Properties tb1)。

b）反序列化Writable類型返回Object:deserialize(Writable blob)。

c）獲取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

public interface Deserializer {

/**

* Initialize the HiveDeserializer.

* @param conf System properties

* @param tbl table properties

* @throws SerDeException

public void initialize(Configuration conf, Properties tbl) throws SerDeException;

/**

* Deserialize an object out of a Writable blob.

* In most cases, the return value of this function will be constant since the function

* will reuse the returned object.

* If the client wants to keep a copy of the object, the client needs to clone the

* returned value by calling ObjectInspectorUtils.getStandardObject().

* @param blob The Writable object containing a serialized object

* @return A Java object representing the contents in the blob.

public Object deserialize(Writable blob) throws SerDeException;

/**

* Get the object inspector that can be used to navigate through the internal

* structure of the Object returned from deserialize(...).

public ObjectInspector getObjectInspector() throws SerDeException;

}

實現一行數據劃分成hive表的time,userid,host,path四個字段的反序列化類。例如：

package hive.connect;

import java.net.MalformedURLException;

import java.net.URL;

import java.util.ArrayList;

import java.util.List;

import java.util.Properties;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hive.serde2.Deserializer;

import org.apache.hadoop.hive.serde2.SerDeException;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

import org.apache.hadoop.hive.serde2.objectinspector.-

ObjectInspectorFactory.ObjectInspectorOptions;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {

private static List<String> FieldNames = new ArrayList<String>();

private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();

static {

FieldNames.add("time");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(Long.class,

ObjectInspectorOptions.JAVA));

FieldNames.add("userid");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(Integer.class,

ObjectInspectorOptions.JAVA));

FieldNames.add("host");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(String.class,

ObjectInspectorOptions.JAVA));

FieldNames.add("path");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(String.class,

ObjectInspectorOptions.JAVA));

}

@Override

public Object deserialize(Writable blob) {

try {

if (blob instanceof Text) {

String line = ((Text) blob).toString();

if (line == null)

return null;

String[] field = line.split("\t");

if (field.length != 3) {

return null;

}

List<Object> result = new ArrayList<Object>();

URL url = new URL(field[2]);

Long time = Long.valueOf(field[0]);

Integer userid = Integer.valueOf(field[1]);

result.add(time);

result.add(userid);

result.add(url.getHost());

result.add(url.getPath());

return result;

}

} catch (MalformedURLException e) {

e.printStackTrace();

}

return null;

}

@Override

public ObjectInspector getObjectInspector() throws SerDeException {

return ObjectInspectorFactory.getStandardStructObjectInspector(

FieldNames, FieldNamesObjectInspectors);

}

@Override

public void initialize(Configuration arg0, Properties arg1)

throws SerDeException {

}

測試HDFS上hive表數據，如下爲一條測試數據：

1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar;

Added /run/jar/merg_hua.jar to class path

hive> create table serde_table row format serde 'hive.connect.TestDeserializer';

Found class for hive.connect.TestDeserializer

Time taken: 0.028 seconds

hive> describe serde_table;

time bigint from deserializer

userid int from deserializer

host string from deserializer

path string from deserializer

Time taken: 0.042 seconds

hive> select * from serde_table;

1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF

Time taken: 0.039 seconds

三、總結

1、創建Hive表使用序列化時，需要自寫一個實現Deserializer的類，並且選用create命令的row format參數。

2、在處理海量數據的時候，如果數據的格式與表結構吻合，可以用到Hive的反序列化而不需要對數據進行轉換，可以節省大量的時間。

1.概述

當進程在進行遠程通信時，彼此可以發送各種類型的數據，無論是什麼類型的數據都會以二進制序列的形式在網絡上傳送。發送方需要把對象轉化爲字節序列纔可在網絡上傳輸，稱爲對象序列化；接收方則需要把字節序列恢復爲對象，稱爲對象的反序列化。Hive的反序列化是對key/value反序列化成hivetable的每個列的值。Hive可以方便的將數據加載到表中而不需要對數據進行轉換，這樣在處理海量數據時可以節省大量的時間。

2. SerDe使用

用戶在建表時可以用自定義的SerDe或使用Hive自帶的SerDe，SerDe能爲表指定列，且對列指定相應的數據。

創建指定SerDe表時，使用row format row_format參數。

編寫序列化類TestDeserializer。實現Deserializer接口的三個函數：

a）初始化：initialize(Configuration conf, Properties tb1)。

b）反序列化Writable類型返回Object:deserialize(Writable blob)。

c）獲取deserialize(Writableblob)返回值Object的inspector:getObjectInspector()。

3. 示例

importjava.net.MalformedURLException;
import java.net.URL;
importjava.util.ArrayList;
import java.util.List;
importjava.util.Properties;

importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.hive.serde2.Deserializer;
importorg.apache.hadoop.hive.serde2.SerDeException;
importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import

org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {
  private static List FieldNames = new ArrayList();
  private static List FieldNamesObjectInspectors= newArrayList();
  static {
    FieldNames.add("time");
    FieldNamesObjectInspectors.add(ObjectInspectorFactory
         .getReflectionObjectInspector(Long.class,
              ObjectInspectorOptions.JAVA));
    FieldNames.add("userid");
    FieldNamesObjectInspectors.add(ObjectInspectorFactory
         .getReflectionObjectInspector(Integer.class,
              ObjectInspectorOptions.JAVA));
    FieldNames.add("host");
    FieldNamesObjectInspectors.add(ObjectInspectorFactory
         .getReflectionObjectInspector(String.class,
              ObjectInspectorOptions.JAVA));

    FieldNames.add("path");
    FieldNamesObjectInspectors.add(ObjectInspectorFactory
         .getReflectionObjectInspector(String.class,
              ObjectInspectorOptions.JAVA));

   }

   @Override
  public Object deserialize(Writable blob){
    try {
       if (blob instanceof Text) {
         String line = ((Text) blob).toString();
         if (line == null)
            return null;
         String[] field =line.split("/t");
         if (field.length != 3) {
            return null;
         }
         Listresult = newArrayList();
         URL url = new URL(field[2]);
         Long time = Long.valueOf(field[0]);

Hive 中SerDe概述

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

同事使用 insert into select 遷移數據，開開心心上線，上線後被公司開除！

DeepFilterNet復現

Hive 中SerDe概述

mongodb數據庫命令操作

Spark程序模型

SQL中LIKE模糊查詢與REGEXP用法說明

流式大數據處理的三種框架：Storm，Spark和Samza

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結