環境準備

安裝 hadoop，參照：https://www.jianshu.com/p/9c8a0f7b98cf
安裝hive，參照：https://www.jianshu.com/p/ed4c2852754c

說明：本文測試環境爲單機，而非集羣環境。

CLI連接

安裝好之後，可通過客戶端，通過hive命令直接連接，並進行相關操作:

HiveServer2/beeline連接

CLI連接時，輸入hive實際上是在啓動的時候執行：hive --service cli。
而在beeline連接時，需通過：hive --service hiveserver2來開啓服務。

啓動之後通過jps命令可以查看到名爲RunJar的進程。
但這種啓動方式在終端關閉後，服務隨之關閉。最好是通過後臺服務的方式啓動：

nohup hiveserver2 1>[標準日誌輸出路徑] 2>[錯誤日誌輸出路徑] &

nohup表示在終端關閉時服務不掛起，1表示標準日誌輸出，2表示錯誤日誌輸出，&是啓動爲後臺服務所必須的。
hiveserver2服務啓動之後，就可以通過beeline客戶端去連接了。
beeline在hive的bin目錄下。連接命令爲：

-u 參數爲元數據庫的連接信息，-n 指定用戶名和密碼。
連接成功後，就可以執行數據庫操作了：

python+beeline+hql

在代碼裏該如何通過命令行方式來調用hql執行hive的數據庫操作呢？demo如下：

# coding=utf-8
import os
import sys
import logging
import time
import tempfile
import commands

def get_user():
    logging.basicConfig(stream=sys.stdout,
                        level=logging.INFO,
                        format='%(asctime)s %(levelname)s %(message)s')
    ts = str(int(time.time()))
    work_dir = os.path.join(tempfile.gettempdir(), 'user_ids')
    output_dir = os.path.join(work_dir, ts)
    if not os.path.exists(work_dir):
        os.mkdir(work_dor)
    try:
        logging.info("start to execute hive")
        get_user_from_dp(output_dir)  # 從hive中獲取數據
        tmp_result = load_data(output_dir)
        print tmp_result
        if os.path.exists(output_dir):
            os.system('rm -rf ' + output_dir)
    except Exception, e:
        logging.info(e)

def get_user_from_dp(output_dir):
    if not (os.path.exists(output_dir) and os.path.isdir(output_dir) and os.listdir(output_dir) != []):
        '''hql'''
        hql = '''insert overwrite local directory '{output_dir}' row format delimited fields terminated by '\\t' stored 
        as textfile select id, name from test_work.user_info;!q'''.format(output_dir=output_dir)
        hive_cmd(output_dir, hql)  # 執行hql腳本
    return output_dir
 
def hive_cmd(output_dir, cmd):
    file_name = 'balance_%s.hql' % os.path.basename(output_dir)
    file_path = os.path.join(os.path.dirname(os.path.abspath(output_dir)), file_name)
    with open(file_path, 'w') as f:
        f.write(cmd)
    '''獲取hive2 Server'''
    server = '''HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal" /usr/local/hive/bin/beeline -u  'jdbc:hive2://localhost:10000' -n 'hadoop' -f {path} '''.format(path=file_path)
    os_cmd(server)
    os.remove(file_path)

def os_cmd(cmd):
    (s, o) = commands.getstatusoutput(cmd)
    if s != 0:
        raise Exception('error code %s: %s msg: %s' % (s, cmd, o))
    return s, o

def load_data(output_dir):
    result = []
    separator = '\t'
    for file_name in os.listdir(output_dir):
        file_path = os.path.join(output_dir, file_name)
        with open(file_path, 'r') as f:
            for line in f:
                line = line.strip('\n')
                if line.strip() == 0:
                    continue
                items = line.split(separator)
                if len(items) < 2:
                    continue
                id, author = items[0], items[1]
                result.append((id, author))
        os.remove(file_path)
    return result
 
 
if __name__ == '__main__':
    get_user()
    print 'finish'

在本例中，先將hql語句寫入腳本文件/tmp/user_ids/balance_xxx.hql（xxx爲時間戳）。通過命令行方式建立起hive的數據庫連接，並執行hql，將結果輸出到/tmp/user_ids/xxx目錄下。然後讀取查詢到的結果，進行處理。

注意tmp/user_ids文件夾的權限，當權限不足時可能會報錯： Error: Error while compiling
statement: FAILED: IllegalStateException Cannot create staging
directory 'file:/tmp/fiction_ids/xxxx

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive連接方式

環境準備

CLI連接

HiveServer2/beeline連接

python+beeline+hql

分佈式存儲系統HBASE-基礎

tornado之多進程服務

redis原理和使用-安裝和分佈式配置

M2Crypto安裝失敗問題處理

wps提示系統缺失字體

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結