Notes

wikipedia^[1] 用於檢索的數據集，包含 2866 個樣本、10 個類，圖像、文本兩個模態。
想按照 [2] 的設置處理數據，而 [2] 的設置應該來自 [3]，即 images 用 CaffeNet^[4] 提取 fc7 層^[5] 的 4096 維特徵，texts 用 word2vec^[6] 提取每個單詞的 100 維詞向量並取平均。
暫時用 Keras 預訓練的 VGG16^[7,8] 代替 CaffeNet，參考 [12]；word2vec 特徵用 gensim^[9] 庫生成，參考 [13, 14]。

Data

從 [10] 下載，解壓之後有 trainset_txt_img_cat.list 和 testset_txt_img_cat.list 兩個文件，裏面每行代表一個樣本，分 3 列：text 文件名、image 文件名、class id。
text 數據在 texts/ 下，裝在 .xml 文件裏。本想用 minidom^[11] 解析，但因爲一些特殊符號（比如單獨的 &）解析不了，未找到好方法，暫時手動解析。
image 數據在 images/ 下，分類放在不同文件夾。

Code

import os
from os.path import join
import numpy as np

from gensim.models import Word2Vec

from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.models import Model


P = "wikipedia_dataset"
IMG_P = "images"
TXT_P = "texts"
TRAIN_LIST = "trainset_txt_img_cat.list"
TEST_LIST = "testset_txt_img_cat.list"

os.chdir(P)  # 切去解壓目錄
print(os.getcwd())

sample list

將 sample list 讀出來，方便以同一順序處理 images、texts、labels

ls_img = []
ls_txt = []
ls_lab = []

for fname in (TRAIN_LIST, TEST_LIST):
    with open(fname, "r") as f:
        for line in f:
            txt_f, img_f, lab = line.split()
            #txt_f = join(TXT_P, txt_f, ".xml")
            #img_f = join(IMG_P, img_f, ".jpg")
            ls_img.append(img_f)
            ls_txt.append(txt_f)
            ls_lab.append(int(lab))

print(len(ls_img), len(ls_txt), len(ls_lab))

labels

labels 轉成 one-hot 保存

labels = np.asarray(ls_lab)
print(labels.shape, np.max(labels), np.min(labels))  # (2866,) 10 1
N_CLASS = np.max(labels)
labels -= 1  # shift to [0, N_CLASS - 1]
labels = np.eye(N_CLASS)[labels]  # to one-hot
print(labels.shape)  # (2866, 10)
np.save("labels.npy", labels)

texts

手動解析 .xml，清除一些多餘的符號

def parse(fn):
	"""手動解析 xml：讀 <text> </text> 之間的部分"""
    res = ""
    flag = False
    with open(fn, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line == "</text>":
                break
            if flag:
                res += " " + line
            if line == "<text>":
                flag = True
    return res


def clean(strings, pattern):
	"""驅邪……"""
    return [s.replace(pattern, "") for s in strings]


"""解析 xml"""
sentences = []
for txt_f in ls_txt:
    txt_f = join(TXT_P, "{}.xml".format(txt_f))
    # print(txt_f)
    doc = parse(txt_f)  # 手動解析
    # doc = minidom.parse(txt_f).documentElement.getElementsByTagName("text")[0].childNodes[0].data
    words = doc.split()
    # 清除多餘符號
    for pat in (",", ".", "!", "?", "''", "(", ")", "\"", ":", ";", "{", "}", "[", "]"):
        words = clean(words, pat)
    sentences.append(words)

print(len(sentences))


"""訓練 word2vec 模型"""
# [3] 說用 skip-gram
w2v = Word2Vec(sentences, size=100, min_count=5, iter=50, sg=1)  # sg = skip-gram


"""提取文本特徵"""
texts = np.zeros([len(sentences), 100])
for i, s in enumerate(sentences):
    cnt = 0
    for w in s:
        if w in w2v:
            cnt += 1
            texts[i] += w2v[w]
    # 取平均詞向量
    texts[i] /= cnt

# 保存
np.save("texts.w2v.100.npy", texts)

images

將圖片全部複製到同一個目錄，方便操作。用 VGG16 提特徵

ALL_IMG_P = "images_all"
if not os.path.exists(ALL_IMG_P):
    os.makedirs(ALL_IMG_P)


"""全複製到 ALL_IMG_P"""
for cls in os.listdir(IMG_P):
    cls_d = join(IMG_P, cls)
    # print(os.listdir(cls_d))
    for img in os.listdir(cls_d):
        # os.system("cp {} {}".format(join(cls_d, img), ALL_IMG_P))  # linux
        os.system("copy {} {}".format(join(cls_d, img), ALL_IMG_P))  # windows
print(len(os.listdir(ALL_IMG_P)))


"""提特徵"""
base_model = VGG16(weights='imagenet')
# print(base_model.summary())
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc2').output)
# print(model.summary())

images = []
for i_name in ls_img:
    img_f = join(ALL_IMG_P, "{}.jpg".format(i_name))
    img = image.load_img(img_f, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    images.append(model.predict(x))

images = np.vstack(images)
print(images.shape)

# 保存
np.save("images.vgg16.npy", images)

Processed Data

數據放在百度雲盤，有原數據和處理過的。
鏈接：https://pan.baidu.com/s/19pjYO5Uxsq2aiGFqofp-CQ，提取碼：gr9m。

References

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

wikipedia數據集預處理

Notes

Data

Code

sample list

labels

texts

images

Processed Data

References

lasagne embedding layer理解

tensorflow實現triplet loss

NUS-WIDE數據集劃分

pickle讀文件解碼問題

tensorflow用gather/scatter實現advanced indexing

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結