Python進階(6) Hive/HBase/HDFS

原創

清欢守护者

2020-05-24 08:12

文章目錄

2. HBase

3. HDFS

0. 前言

1. Hive

1.1. 基本信息

參考資料：dropbox/PyHive
使用的是 hiveserver2 服務，默認端口是10000。
Linux下安裝：
- conda install thrift sasl pyhive
- PS：直接用pip好像不太行，不能安裝sasl。
Windows下安裝：
- 安裝Visual C++ 2015 Build Tools
- 安裝python包：
  - 由於使用pip install sasl有問題，所以到這裏直接下載sasl的whl文件，通過pip進行安裝。
  - pip install PyHive

1.2. 基本使用

可以通過 DB-API 或 SqlAlchemy 來操作 Hive。
其基本使用其實就是查看 DB-API 或 SqlAlchemy。
DB-API 實例

from pyhive import hive
conn = hive.Connection(host='10.8.13.120', port=10000, username='hdfs', database='default')
cursor = conn.cursor()
cursor.execute('show tables')

for result in cursor.fetchall():
    print(result)

SqlAlchemy 實例

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
# Presto
engine = create_engine('presto://localhost:8080/hive/default')
# Hive
engine = create_engine('hive://localhost:10000/default')
logs = Table('my_awesome_data', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=logs).scalar())

2. HBase

2.1. 基本信息

參考資料
進入響應環境後，安裝happybase與thrift

pip install happybase
pip install thrift

錯誤處理：
- 出現的錯誤：thriftpy.parser.exc.ThriftParserError: ThriftPy does not support generating module with path in protocol 'd'
- Windows中才會出現此問題。
- 參考此文處理
- 解決方案：修改Lib\site-packages\thriftpy\parser\parser.py文件中的代碼：

# 修改前
url_scheme = urlparse(path).scheme
if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))
                                
# 修改後
url_scheme = urlparse(path).scheme
if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('c', 'd','e','f'): # 代表c盤、d盤、e盤、f盤等
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

2.2. 基本使用

建立連接

import happybase
connection = happybase.Connection(HOST_IP)

顯示可用表剛

print(connection.tables())

創建表

# DOCS: http://happybase.readthedocs.io/en/latest/api.html#happybase.Connection.create_table
# create_table(name, families)
# name (str) – The table name
# families (dict) – The name and options for each column family
families = {
    'cf1': dict(max_versions=10),
    'cf2': dict(max_versions=1, block_cache_enabled=False),
    'cf3': dict(),  # use defaults
}
connection.create_table('mytable', families)

獲取表、行對象

# 不需要進行編碼
table = connection.table('table_name')

# 需要進行編碼
# 取得的數據結構是字典，形如 {b'cf:col1': b'value1'}
row = table.row(b'row_key')

2.5. 基本操作

# 獲取數據，需要編碼
print(row[b'cf1:col1'])

# 存儲數據，需要編碼
# DOCS: http://happybase.readthedocs.io/en/latest/api.html#happybase.Table.put
table.put(b'row-key',  {b'cf:col1': b'value1', b'cf:col2': b'value2'}, timestamp=123456789)
table.put(b'row-key',  {b'cf:col1': b'value1'})

# 刪除數據，需要編碼
table.delete(b'row-key')
table.delete(b'row-key', columns=[b'cf1:col1', b'cf1:col2'])

2.6. 批量操作

# DOCS: http://happybase.readthedocs.io/en/latest/api.html#batch
b = table.batch()
b.put(b'row-key-1', {b'cf:col1': b'value1', b'cf:col2': b'value2'})
b.put(b'row-key-2', {b'cf:col2': b'value2', b'cf:col3': b'value3'})
b.put(b'row-key-3', {b'cf:col3': b'value3', b'cf:col4': b'value4'})
b.delete(b'row-key-4')
b.send()

2.7 連接池

# DOCS: http://happybase.readthedocs.io/en/latest/api.html#connection-pool
pool = happybase.ConnectionPool(size=3, host='...')
# 應儘快使用connection對象，不應在with中處理數據
# 在with中獲取數據，在with外處理數據
with pool.connection() as connection:
    table = connection.table('table-name')
    row = table.row(b'row-key')

process_data(row)

3. HDFS

3.1. 基本信息

參考資料
安裝：pip install hdfs

3.1. 基本使用

創建 client 對象

from hdfs.client import Client
client = Client("http://hdfs:50070/", root="/")

其他基本操作

# 創建目錄
client.makedirs("/test",permission=777)  

# 查看指定目錄下文件列表
# status：爲True時，也返回子目錄的狀態信息，默認爲Flase
client.list(hdfs_path, status=False)

# 重命名/移動文件
client.rename(hdfs_src_path, hdfs_dst_path)

# 寫入
# 追加/覆蓋文件，主要看 overwrite 選項
client.write(hdfs_path, data, overwrite=True, append=False)

# 從hdfs下載文件到本地
client.download(hdfs_path, local_path, overwrite=False)

# 從本地上傳文件到hdfs
client.upload(hdfs_path, local_path, cleanup=True)

# 刪除hdfs中文件
client.delete(hdfs_path)

# 讀取文件
with client.read('foo') as reader:
    content = reader.read()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python進階(6) Hive/HBase/HDFS

文章目錄

0. 前言

1. Hive

1.1. 基本信息

1.2. 基本使用

2. HBase

2.1. 基本信息

2.2. 基本使用

3. HDFS

3.1. 基本信息

3.1. 基本使用

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

TensorRT C++ Samples(1) sampleMNIST

ResNet/ResNet-I3D/ResNet-I3D-SlowFast 源碼閱讀

Ubuntu Desktop 使用記錄

論文瀏覽(11) A Multigrid Method for Efficiently Training Video Models

論文瀏覽(10) Towards Real-Time Multi-Object Tracking

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結