Python進階(6) Hive/HBase/HDFS

0. 前言

1. Hive

1.1. 基本信息

  • 參考資料:dropbox/PyHive
  • 使用的是 hiveserver2 服務,默認端口是10000。
  • Linux下安裝:
    • conda install thrift sasl pyhive
    • PS:直接用pip好像不太行,不能安裝sasl。
  • Windows下安裝:

1.2. 基本使用

  • 可以通過 DB-APISqlAlchemy 來操作 Hive。
  • 其基本使用其實就是查看 DB-API 或 SqlAlchemy。
  • DB-API 實例
from pyhive import hive
conn = hive.Connection(host='10.8.13.120', port=10000, username='hdfs', database='default')
cursor = conn.cursor()
cursor.execute('show tables')

for result in cursor.fetchall():
    print(result)
  • SqlAlchemy 實例
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
# Presto
engine = create_engine('presto://localhost:8080/hive/default')
# Hive
engine = create_engine('hive://localhost:10000/default')
logs = Table('my_awesome_data', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=logs).scalar())

2. HBase

2.1. 基本信息

pip install happybase
pip install thrift
  • 錯誤處理:
    • 出現的錯誤:thriftpy.parser.exc.ThriftParserError: ThriftPy does not support generating module with path in protocol 'd'
    • Windows中才會出現此問題。
    • 參考此文處理
    • 解決方案:修改Lib\site-packages\thriftpy\parser\parser.py文件中的代碼:
# 修改前
url_scheme = urlparse(path).scheme
if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))
                                
# 修改後
url_scheme = urlparse(path).scheme
if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('c', 'd','e','f'): # 代表c盤、d盤、e盤、f盤等
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

2.2. 基本使用

  • 建立連接
import happybase
connection = happybase.Connection(HOST_IP)
  • 顯示可用表剛
print(connection.tables())
  • 創建表
# DOCS: http://happybase.readthedocs.io/en/latest/api.html#happybase.Connection.create_table
# create_table(name, families)
# name (str) – The table name
# families (dict) – The name and options for each column family
families = {
    'cf1': dict(max_versions=10),
    'cf2': dict(max_versions=1, block_cache_enabled=False),
    'cf3': dict(),  # use defaults
}
connection.create_table('mytable', families)
  • 獲取表、行對象
# 不需要進行編碼
table = connection.table('table_name')

# 需要進行編碼
# 取得的數據結構是字典,形如 {b'cf:col1': b'value1'}
row = table.row(b'row_key')
  • 2.5. 基本操作
# 獲取數據,需要編碼
print(row[b'cf1:col1'])

# 存儲數據,需要編碼
# DOCS: http://happybase.readthedocs.io/en/latest/api.html#happybase.Table.put
table.put(b'row-key',  {b'cf:col1': b'value1', b'cf:col2': b'value2'}, timestamp=123456789)
table.put(b'row-key',  {b'cf:col1': b'value1'})

# 刪除數據,需要編碼
table.delete(b'row-key')
table.delete(b'row-key', columns=[b'cf1:col1', b'cf1:col2'])
  • 2.6. 批量操作
# DOCS: http://happybase.readthedocs.io/en/latest/api.html#batch
b = table.batch()
b.put(b'row-key-1', {b'cf:col1': b'value1', b'cf:col2': b'value2'})
b.put(b'row-key-2', {b'cf:col2': b'value2', b'cf:col3': b'value3'})
b.put(b'row-key-3', {b'cf:col3': b'value3', b'cf:col4': b'value4'})
b.delete(b'row-key-4')
b.send()
  • 2.7 連接池
# DOCS: http://happybase.readthedocs.io/en/latest/api.html#connection-pool
pool = happybase.ConnectionPool(size=3, host='...')
# 應儘快使用connection對象,不應在with中處理數據
# 在with中獲取數據,在with外處理數據
with pool.connection() as connection:
    table = connection.table('table-name')
    row = table.row(b'row-key')

process_data(row)

3. HDFS

3.1. 基本信息

3.1. 基本使用

  • 創建 client 對象
from hdfs.client import Client
client = Client("http://hdfs:50070/", root="/")
  • 其他基本操作
# 創建目錄
client.makedirs("/test",permission=777)  

# 查看指定目錄下文件列表
# status:爲True時,也返回子目錄的狀態信息,默認爲Flase
client.list(hdfs_path, status=False)

# 重命名/移動文件
client.rename(hdfs_src_path, hdfs_dst_path)

# 寫入
# 追加/覆蓋文件,主要看 overwrite 選項
client.write(hdfs_path, data, overwrite=True, append=False)

# 從hdfs下載文件到本地
client.download(hdfs_path, local_path, overwrite=False)

# 從本地上傳文件到hdfs
client.upload(hdfs_path, local_path, cleanup=True)

# 刪除hdfs中文件
client.delete(hdfs_path)

# 讀取文件
with client.read('foo') as reader:
    content = reader.read()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章