從GitHub
上下載的代碼,涉及到的數據都是以.h5
爲後綴的,那麼這是什麼類型的文件呢?
可以找到,代碼中都引入了這個包h5py
,接下來一起看看這種文件怎麼讀取吧!
An HDF5 file is a container for two kinds of objects: datasets
, which are array-like collections of data, andgroups
, which are folder-like containers that hold datasets and other groups.
Groups work like dictionaries, and datasets work like NumPy arrays
拿到別人給你的文件mytestfile.hdf5
,如何讀取呢?看下面代碼:
# 導入包
import h5py
# 讀取文件,r表示只讀
f = h5py.File('mytestfile.hdf5', 'r')
# 查看文件keys
list(f.keys()) # ['mydataset']
# Remember h5py.File acts like a Python dictionary, thus we can check the keys
# 將file中key對應的dataset提取出來
data = f['mydataset']
# 這個dataset是HDF5 dataset類型,有dtype和shape可以查看類型和大小
data.shape
data.dtype
# 同樣支持array的切片操作
data[...] = np.arange(100)
data[0:100:10] # array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
那麼,你自己可以創建一個HDF5
嗎?
import h5py
# 創建一個file,w表示寫
f = h5py.File("myh5py.hdf5","w")
# 創建一個dataset
dataset = f.create_dataset("d1",(100,),dtype="i")
# 創建方式還可以如下,這樣更安全一些,因爲執行完with會自動釋放file
with h5py.File("mytestfile.hdf5","w") as f:
dataset = f.create_dataset("d1",(100,),dtype="i")
# 可以查看你的文件啦!
for key in f.keys():
print(key)
print(f[key].name)
print(f[key].shape)
print(f[key].value)
# 結果:
# d1
# /d1
# (100,)
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# 用numpy數組爲dataset賦值
import numpy as np
dataset[...] = np.arange(100)
# 添加一個新的dataset,直接爲file的該dataset賦值
f["lala"] = np.arange(100)
# 查看結果
for key in f.keys():
print(f[key].name)
print(f[key].value)
# /d1
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
# 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
# 96 97 98 99]
# /lala
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
# 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
# 96 97 98 99]
# 直接用現成的numpy數組
arr = np.arange(100)
dset = f.create_dataset("dset", data=arr)
Group
是 HDF5 文件組織的容器機制。從 Python 的角度來看,它們的運作方式有點像詞典(key-value)。在這種情況下,“keys”是組成員的名稱,“values”是成員本身(Group 和 Dataset)對象。【生成的結果有點像文件夾,比原先的dataset多了上一級目錄名。】
Note! When using h5py from Python 3, the
keys()
,values()
anditems()
methods will return view-like objects instead of lists. These objects support membership testing and iteration, but can’t be sliced like lists.
import h5py
import numpy as np
# 創建一個“寫”文件
f = h5py.File("myh5py.hdf5", "w")
# 創建一個名字爲bar的組
g = f.create_group("bar")
# 在bar這個組裏面分別創建name爲dset1, dset2的dataset並賦值
g["dset1"] = np.arange(100)
g["dset2"] = np.arange(100).reshape((10, 10))
for key in g.keys():
print(g[key].name)
print(g[key].value)
# /bar/dset1
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
# 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
# 96 97 98 99]
# /bar/dset2
# [[ 0 1 2 3 4 5 6 7 8 9]
# [10 11 12 13 14 15 16 17 18 19]
# [20 21 22 23 24 25 26 27 28 29]
# [30 31 32 33 34 35 36 37 38 39]
# [40 41 42 43 44 45 46 47 48 49]
# [50 51 52 53 54 55 56 57 58 59]
# [60 61 62 63 64 65 66 67 68 69]
# [70 71 72 73 74 75 76 77 78 79]
# [80 81 82 83 84 85 86 87 88 89]
# [90 91 92 93 94 95 96 97 98 99]]
如何讀取一個Group
的data?
# 得到一個Group類型的h5文件
f1 = h5py.File(file_path)
print(f1.keys()) # 結果:<KeysViewHDF5 ['df']>
# 得知f1只有一個key,把它對應的value拿出來
group = f1['df']
# 查看以上得到的value是什麼類型的
print(group.name, type(group), group.keys(), group.values())
# 結果:name:"/df";type="<class 'h5py._hl.group.Group'>";keys="<KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']>";values="ValuesViewHDF5(<HDF5 group "/df" (4 members)>)"
# 大概對數據有了解之後,將其提取出來
for (index, key) in enumerate(group.keys()):
print(len(group[key][()])) # 207 34272 207 34272
print(group[key][()].shape) # 你也可以打印他的shape
data.append(group[key][()]) # 把數據拿到數組裏來
參考: