本文分享自華爲雲社區《CNN-VIT 視頻動態手勢識別【玩轉華爲雲】》,作者: HouYanSong。
CNN-VIT 視頻動態手勢識別
人工智能的發展日新月異,也深刻的影響到人機交互領域的發展。手勢動作作爲一種自然、快捷的交互方式,在智能駕駛、虛擬現實等領域有着廣泛的應用。手勢識別的任務是,當操作者做出某個手勢動作後,計算機能夠快速準確的判斷出該手勢的類型。本文將使用ModelArts開發訓練一個視頻動態手勢識別的算法模型,對上滑、下滑、左滑、右滑、打開、關閉等動態手勢類別進行檢測,實現類似華爲手機隔空手勢的功能。
算法簡介
CNN-VIT 視頻動態手勢識別算法首先使用預訓練網絡InceptionResNetV2逐幀提取視頻動作片段特徵,然後輸入Transformer Encoder進行分類。我們使用動態手勢識別樣例數據集對算法進行測試,總共包含108段視頻,數據集包含無效手勢、上滑、下滑、左滑、右滑、打開、關閉等7種手勢的視頻,具體操作流程如下:
首先我們將採集的視頻文件解碼抽取關鍵幀,每隔4幀保存一次,然後對圖像進行中心裁剪和預處理,代碼如下:
def load_video(file_name): cap = cv2.VideoCapture(file_name) # 每隔多少幀抽取一次 frame_interval = 4 frames = [] count = 0 while True: ret, frame = cap.read() if not ret: break # 每隔frame_interval幀保存一次 if count % frame_interval == 0: # 中心裁剪 frame = crop_center_square(frame) # 縮放 frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE)) # BGR -> RGB [0,1,2] -> [2,1,0] frame = frame[:, :, [2, 1, 0]] frames.append(frame) count += 1 return np.array(frames)
然後我們創建圖像特徵提取器,使用預訓練模型InceptionResNetV2提取圖像特徵,代碼如下:
def get_feature_extractor(): feature_extractor = keras.applications.inception_resnet_v2.InceptionResNetV2( weights = 'imagenet', include_top = False, pooling = 'avg', input_shape = (IMG_SIZE, IMG_SIZE, 3) ) preprocess_input = keras.applications.inception_resnet_v2.preprocess_input inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3)) preprocessed = preprocess_input(inputs) outputs = feature_extractor(preprocessed) model = keras.Model(inputs, outputs, name = 'feature_extractor') return model
接着提取視頻特徵向量,如果視頻不足40幀就創建全0數組進行補白:
def load_data(videos, labels): video_features = [] for video in tqdm(videos): frames = load_video(video) counts = len(frames) # 如果幀數小於MAX_SEQUENCE_LENGTH if counts < MAX_SEQUENCE_LENGTH: # 補白 diff = MAX_SEQUENCE_LENGTH - counts # 創建全0的numpy數組 padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)) # 數組拼接 frames = np.concatenate((frames, padding)) # 獲取前MAX_SEQUENCE_LENGTH幀畫面 frames = frames[:MAX_SEQUENCE_LENGTH, :] # 批量提取特徵 video_feature = feature_extractor.predict(frames) video_features.append(video_feature) return np.array(video_features), np.array(labels)
最後創建VIT Model,代碼如下:
# 位置編碼 class PositionalEmbedding(layers.Layer): def __init__(self, seq_length, output_dim): super().__init__() # 構造從0~MAX_SEQUENCE_LENGTH的列表 self.positions = tf.range(0, limit=MAX_SEQUENCE_LENGTH) self.positional_embedding = layers.Embedding(input_dim=seq_length, output_dim=output_dim) def call(self,x): # 位置編碼 positions_embedding = self.positional_embedding(self.positions) # 輸入相加 return x + positions_embedding # 編碼器 class TransformerEncoder(layers.Layer): def __init__(self, num_heads, embed_dim): super().__init__() self.p_embedding = PositionalEmbedding(MAX_SEQUENCE_LENGTH, NUM_FEATURES) self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim, dropout=0.1) self.layernorm = layers.LayerNormalization() def call(self,x): # positional embedding positional_embedding = self.p_embedding(x) # self attention attention_out = self.attention( query = positional_embedding, value = positional_embedding, key = positional_embedding, attention_mask = None ) # layer norm with residual connection output = self.layernorm(positional_embedding + attention_out) return output def video_cls_model(class_vocab): # 類別數量 classes_num = len(class_vocab) # 定義模型 model = keras.Sequential([ layers.InputLayer(input_shape=(MAX_SEQUENCE_LENGTH, NUM_FEATURES)), TransformerEncoder(2, NUM_FEATURES), layers.GlobalMaxPooling1D(), layers.Dropout(0.1), layers.Dense(classes_num, activation="softmax") ]) # 編譯模型 model.compile(optimizer = keras.optimizers.Adam(1e-5), loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics = ['accuracy'] ) return model
模型訓練
完整體驗可以點擊Run in ModelArts一鍵運行我發佈的Notebook:
最終模型在整個數據集上的準確率達到87%,即在小數據集上訓練取得了較爲不錯的結果。
視頻推理
首先加載VIT Model,獲取視頻類別索引標籤:
import random # 加載模型 model = tf.keras.models.load_model('saved_model') # 類別標籤 label_to_name = {0:'無效手勢', 1:'上滑', 2:'下滑', 3:'左滑', 4:'右滑', 5:'打開', 6:'關閉', 7:'放大', 8:'縮小'}
然後使用圖像特徵提取器InceptionResNetV2提取視頻特徵:
# 獲取視頻特徵 def getVideoFeat(frames): frames_count = len(frames) # 如果幀數小於MAX_SEQUENCE_LENGTH if frames_count < MAX_SEQUENCE_LENGTH: # 補白 diff = MAX_SEQUENCE_LENGTH - frames_count # 創建全0的numpy數組 padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)) # 數組拼接 frames = np.concatenate((frames, padding)) # 取前MAX_SEQ_LENGTH幀 frames = frames[:MAX_SEQUENCE_LENGTH,:] # 計算視頻特徵 N, 1536 video_feat = feature_extractor.predict(frames) return video_feat
最後將視頻序列的特徵向量輸入Transformer Encoder進行預測:
# 視頻預測 def testVideo(): test_file = random.sample(videos, 1)[0] label = test_file.split('_')[-2] print('文件名:{}'.format(test_file) ) print('真實類別:{}'.format(label_to_name.get(int(label))) ) # 讀取視頻每一幀 frames = load_video(test_file) # 挑選前幀MAX_SEQUENCE_LENGTH顯示 frames = frames[:MAX_SEQUENCE_LENGTH].astype(np.uint8) # 保存爲GIF imageio.mimsave('animation.gif', frames, duration=10) # 獲取特徵 feat = getVideoFeat(frames) # 模型推理 prob = model.predict(tf.expand_dims(feat, axis=0))[0] print('預測類別:') for i in np.argsort(prob)[::-1][:5]: print('{}: {}%'.format(label_to_name[i], round(prob[i]*100, 2))) return display(Image(open('animation.gif', 'rb').read()))
模型預測結果:
文件名:hand_gesture/woman_014_0_7.mp4 真實類別:無效手勢 預測類別: 無效手勢: 99.82% 下滑: 0.12% 關閉: 0.04% 左滑: 0.01% 打開: 0.01%