轉義字符\(在hive+shell以及java中注意事項):正則表達式的轉義字符爲雙斜線,split函數解析也是正則

轉義字符

將後邊字符轉義,使特殊功能字符作爲普通字符處理,或者普通字符轉化爲特殊功能字符。
各個語言中都用應用,如java、python、sql、hive、shell等等。
如sql中
        "\""    
        "\'"
        "\t"
        "\n"
sql中直接輸出 
        "
        '
        tab鍵
        換行鍵

轉義字符的一般應用

"\"轉義字符放到字符前面,如java和python輸出內容用雙引號標識,雙引號中可以用轉義字符\進行轉義輸出,比如輸出雙引號
java中 system.out.print("\"")
python中 print "\""

特殊的情況 :轉義字符自身的轉義

轉義字符的特殊情況,自身的轉義,比如java有時候需要兩個轉義字符"\\",或者四個轉義字符“\\\\”。

1)java的倆種情況 
    正則表達式匹配和string的split函數
    這兩種情況中字符串包含轉義字符“\”時,需要先對轉義字符自身轉義,就是說需要兩個轉義字符“\\”。(java解析後,再有正則和split自身特定進行解析)
    而當匹配字符正斜線“\”,則需要四個轉義字符“\\\\”,因爲,首先java(編譯器?)自身先解析,轉義成兩個“\\”,再由正則或split的解析功能轉義成一個“\”,纔是最終要處理的字符。
    這是因爲解析過程需要兩次,才能在字符串中出現正斜線“\”,出現後才能轉義後面的字符。

2)hive中的split和正則表達式
    hive用java寫的,所以同Java一樣,兩種情況也需要兩個“\\”,split處理代碼爲例:
    select 
    ad,
    '月資費類型' as feature,
    (CASE subscriptionfee_id
        when '0' then '無'
        when '1' then '[0,50)'
        when '2' then '[50,100]'
        when '3' then '[100,150]'
        when '4' then '[150,200)'
        when '5' then '>=200'
        else 'error_data' 
    END) as feature_detail,
    1 as type
    from mengniubi.dianxin_user_tags
    union all
    select 
    ad,
    '愛好分佈' as feature,
    split(new_interest,'\\|')[1] as feature_detail,
    2 as type
    from mengniubi.dianxin_user_tags
    lateral view explode(interests) AllInterests as new_interest
    union all
    select 
    ad,
    '商品瀏覽' as feature,
    split(products,'\\|')[0] as feature_detail,
    4 as type
    from mengniubi.dianxin_user_tags
    lateral view explode(split(product_view_cates,',')) AllProducts as products
代碼中,如果以“\”作爲分隔符的話,那麼就需要4個轉義字符“\\\\”,即
    split(products,'\\\\')[0] as feature_detail,

3)hive語句在shell腳本中執行
shell語言也有轉義字符,自身直接處理。
而hive語句在shell腳本中執行時,就需要先由shell轉義後,再由hive處理。這個過程又造成二次轉義。
如上面的hive語句寫入shell腳本中,執行是錯誤的,shell先解析,轉義成”|“後傳給hive,hive解析這個轉義字符後,split就無法正確的解析了。
所以,注意hive語句在shell腳本執行時,轉義字符需要翻倍。hive處理的是shell轉義後的語句,必須轉以後正確,才能執行。上面代碼在shell腳本中如下

#!/bin/bash
##### execute hive sql for analyzing data #####
arg_count=$#
if [ $arg_count -lt 1 ];then
   echo "參數錯誤 [$*], Usage:$0 2015-08"
   exit 1
fi

if [ ! -d "$HIVE_HOME" ];then
   echo "HIVE_HOME not exists .. "
   exit 2
fi

month_arg=$1
echo "month : ${month_arg}"

echo "start ... "
########################  SQL EDIT AREA 1 BEGIN #####################################
msg="step1 t_bi_daily_ad_area_report .."
echo
echo
echo $msg
echo
echo
sql=$(cat <<!EOF

USE test_bi;
set mapred.queue.names=queue3;
SET mapred.reduce.tasks=14;

insert OVERWRITE table t_bi_figure_whole_network_report partition(month='${month_arg}')
select '-1' as brand,feature,feature_detail,type,count(ad) as count
from 
(
    select 
    ad,
    '月資費類型' as feature,
    (CASE subscriptionfee_id
        when '0' then '無'
        when '1' then '[0,50)'
        when '2' then '[50,100]'
        when '3' then '[100,150]'
        when '4' then '[150,200)'
        when '5' then '>=200'
        else 'error_data' 
    END) as feature_detail,
    1 as type
    from mengniubi.dianxin_user_tags
    union all
    select 
    ad,
    '愛好分佈' as feature,
    split(new_interest,'\\\\|')[1] as feature_detail,
    2 as type
    from mengniubi.dianxin_user_tags
    lateral view explode(interests) AllInterests as new_interest
    union all
    select 
    ad,
    '商品瀏覽' as feature,
    split(products,'\\\\|')[0] as feature_detail,
    4 as type
    from mengniubi.dianxin_user_tags
    lateral view explode(split(product_view_cates,',')) AllProducts as products

) t1
group by feature,feature_detail,type
union all
select '-1' as brand,
    '搜索關鍵字' as feature,
    search_word as feature_detail,
    2 as type,
    count(1) as count 
from mengniubi.dianxin_user_tags
lateral view explode(split(search_keywords,',')) AllKeyWords as search_word
where search_word is not null and search_word <> '' 
group by search_word order 
by count desc 
limit 1000;
!EOF)
########### execute begin ##########
echo $sql
$HIVE_HOME/bin/hive -e "$sql"
exitCode=$?
if [ $exitCode -ne 0 ];then
   echo "[ERROR] $msg"
   exit $exitCode
fi
########### execute end  ###########
########################  SQL EDIT AREA 1 END #######################################

hive中正則表達式的轉義字符用雙斜線

hive的split函數中,分隔符爲正則表達式

split(string str, string pat) 

Splits str around pat (pat is a regular expression).
這裏切分符號是正則表達式,按一個字符分隔沒問題,如
select 
    split(all,'~')
from
tb_pmp_log_all_lmj_tmp
limit 10
當分隔字符是豎線'|'時,直接使用默認爲正則表達式中的或,則爲無,所以會將字段中的單個字符全部分隔開,如
    select 
        split(all,'|')
    from
    tb_pmp_log_all_lmj_tmp
    limit 10

因爲hive正則表達式轉義字符爲兩個\,所以’|’就是’\|’,如

    select 
        split(all,'\\|')
    from
    tb_pmp_log_all_lmj_tmp
    limit 10

而’|+’無法運行,報錯,

當分隔符爲多個符號組合時,用正則要注意雙斜線\\爲轉義字符,否則如’\|’會有問題,
匹配“|~|”分隔符如

select 
    split(all,'\\|~\\|')
from
tb_pmp_log_all_lmj_tmp
limit 10

或者,在[]內部拼接成字符串

select 
    split(all,'[|~]+')
from
tb_pmp_log_all_lmj_tmp
limit 10
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章