我在樹莓派上跑通了bert模型，使用numpy實現bert模型，使用hugging face 或pytorch訓練模型，保存參數爲numpy格式，然後使用numpy加載模型推理

之前分別用numpy實現了mlp，cnn，lstm，這次搞一個大一點的模型bert，純numpy實現，最重要的是可在樹莓派上或其他不能安裝pytorch的板子上運行，推理數據

本次模型是隨便在hugging face上找的一個新聞評論的模型，7分類

看這些模型參數，這並不重要，模型佔硬盤空間都要400+M

bert.embeddings.word_embeddings.weight torch.Size([21128, 768])
bert.embeddings.position_embeddings.weight torch.Size([512, 768])
bert.embeddings.token_type_embeddings.weight torch.Size([2, 768])
bert.embeddings.LayerNorm.weight torch.Size([768])
bert.embeddings.LayerNorm.bias torch.Size([768])
bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.0.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.0.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.0.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.0.output.dense.bias torch.Size([768])
bert.encoder.layer.0.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.0.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.1.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.query.bias torch.Size([768])
bert.encoder.layer.1.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.key.bias torch.Size([768])
bert.encoder.layer.1.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.value.bias torch.Size([768])
bert.encoder.layer.1.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.1.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.1.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.1.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.1.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.1.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.1.output.dense.bias torch.Size([768])
bert.encoder.layer.1.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.1.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.2.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.self.query.bias torch.Size([768])
bert.encoder.layer.2.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.self.key.bias torch.Size([768])
bert.encoder.layer.2.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.self.value.bias torch.Size([768])
bert.encoder.layer.2.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.2.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.2.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.2.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.2.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.2.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.2.output.dense.bias torch.Size([768])
bert.encoder.layer.2.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.2.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.3.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.self.query.bias torch.Size([768])
bert.encoder.layer.3.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.self.key.bias torch.Size([768])
bert.encoder.layer.3.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.self.value.bias torch.Size([768])
bert.encoder.layer.3.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.3.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.3.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.3.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.3.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.3.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.3.output.dense.bias torch.Size([768])
bert.encoder.layer.3.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.3.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.4.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.self.query.bias torch.Size([768])
bert.encoder.layer.4.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.self.key.bias torch.Size([768])
bert.encoder.layer.4.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.self.value.bias torch.Size([768])
bert.encoder.layer.4.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.4.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.4.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.4.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.4.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.4.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.4.output.dense.bias torch.Size([768])
bert.encoder.layer.4.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.4.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.5.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.self.query.bias torch.Size([768])
bert.encoder.layer.5.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.self.key.bias torch.Size([768])
bert.encoder.layer.5.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.self.value.bias torch.Size([768])
bert.encoder.layer.5.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.5.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.5.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.5.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.5.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.5.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.5.output.dense.bias torch.Size([768])
bert.encoder.layer.5.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.5.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.6.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.self.query.bias torch.Size([768])
bert.encoder.layer.6.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.self.key.bias torch.Size([768])
bert.encoder.layer.6.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.self.value.bias torch.Size([768])
bert.encoder.layer.6.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.6.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.6.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.6.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.6.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.6.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.6.output.dense.bias torch.Size([768])
bert.encoder.layer.6.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.6.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.7.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.self.query.bias torch.Size([768])
bert.encoder.layer.7.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.self.key.bias torch.Size([768])
bert.encoder.layer.7.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.self.value.bias torch.Size([768])
bert.encoder.layer.7.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.7.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.7.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.7.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.7.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.7.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.7.output.dense.bias torch.Size([768])
bert.encoder.layer.7.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.7.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.8.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.self.query.bias torch.Size([768])
bert.encoder.layer.8.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.self.key.bias torch.Size([768])
bert.encoder.layer.8.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.self.value.bias torch.Size([768])
bert.encoder.layer.8.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.8.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.8.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.8.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.8.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.8.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.8.output.dense.bias torch.Size([768])
bert.encoder.layer.8.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.8.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.9.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.self.query.bias torch.Size([768])
bert.encoder.layer.9.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.self.key.bias torch.Size([768])
bert.encoder.layer.9.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.self.value.bias torch.Size([768])
bert.encoder.layer.9.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.9.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.9.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.9.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.9.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.9.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.9.output.dense.bias torch.Size([768])
bert.encoder.layer.9.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.9.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.10.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.self.query.bias torch.Size([768])
bert.encoder.layer.10.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.self.key.bias torch.Size([768])
bert.encoder.layer.10.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.self.value.bias torch.Size([768])
bert.encoder.layer.10.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.10.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.10.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.10.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.10.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.10.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.10.output.dense.bias torch.Size([768])
bert.encoder.layer.10.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.10.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.11.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.self.query.bias torch.Size([768])
bert.encoder.layer.11.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.self.key.bias torch.Size([768])
bert.encoder.layer.11.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.self.value.bias torch.Size([768])
bert.encoder.layer.11.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.11.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.11.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.11.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.11.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.11.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.11.output.dense.bias torch.Size([768])
bert.encoder.layer.11.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.11.output.LayerNorm.bias torch.Size([768])
bert.pooler.dense.weight torch.Size([768, 768])
bert.pooler.dense.bias torch.Size([768])
classifier.weight torch.Size([7, 768])
classifier.bias torch.Size([7])

爲了實現numpy的bert模型，踩了兩天的坑，一步步對比huggingface源碼實現的，真的太難了～～～

這是使用numpy實現的bert代碼，分數上和huggingface有稍微的一點點區別，可能是模型太大，保存的模型參數誤差累計造成的！

看下面的代碼真的有利於直接瞭解bert模型結構，各種細節簡單又到位，自己都服自己，研究這個東西～～～

import numpy as np

def word_embedding(input_ids, word_embeddings):
    return word_embeddings[input_ids]

def position_embedding(position_ids, position_embeddings):
    return position_embeddings[position_ids]

def token_type_embedding(token_type_ids, token_type_embeddings):
    return token_type_embeddings[token_type_ids]

def softmax(x, axis=None):
    # e_x = np.exp(x).astype(np.float32) #  
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    sum_ex = np.sum(e_x, axis=axis,keepdims=True).astype(np.float32)
    return e_x / sum_ex

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, scores, np.full_like(scores, -np.inf))
    attention_weights = softmax(scores, axis=-1)
    # print(attention_weights)
    # print(np.sum(attention_weights,axis=-1))
    output = np.matmul(attention_weights, V)
    return output, attention_weights

def multihead_attention(input, num_heads,W_Q,B_Q,W_K,B_K,W_V,B_V,W_O,B_O):

    q = np.matmul(input, W_Q.T)+B_Q
    k = np.matmul(input, W_K.T)+B_K
    v = np.matmul(input, W_V.T)+B_V

    # 分割輸入爲多個頭
    q = np.split(q, num_heads, axis=-1)
    k = np.split(k, num_heads, axis=-1)
    v = np.split(v, num_heads, axis=-1)

    outputs = []
    for q_,k_,v_ in zip(q,k,v):
        output, attention_weights = scaled_dot_product_attention(q_, k_, v_)
        outputs.append(output)
    outputs = np.concatenate(outputs, axis=-1)
    outputs = np.matmul(outputs, W_O.T)+B_O
    return outputs

def layer_normalization(x, weight, bias, eps=1e-12):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    std = np.sqrt(variance + eps)
    normalized_x = (x - mean) / std
    output = weight * normalized_x + bias
    return output

def feed_forward_layer(inputs, weight, bias, activation='relu'):
    linear_output = np.matmul(inputs,weight) + bias
    
    if activation == 'relu':
        activated_output = np.maximum(0, linear_output)  # ReLU激活函數
    elif activation == 'gelu':
        activated_output = 0.5 * linear_output * (1 + np.tanh(np.sqrt(2 / np.pi) * (linear_output + 0.044715 * np.power(linear_output, 3))))  # GELU激活函數
    
    elif activation == "tanh" :
        activated_output = np.tanh(linear_output)
    else:
        activated_output = linear_output  # 無激活函數
    
    return activated_output


def residual_connection(inputs, residual):
    # 殘差連接
    residual_output = inputs + residual
    return residual_output


def tokenize_sentence(sentence, vocab_file = 'vocab.txt'):
    with open(vocab_file, 'r', encoding='utf-8') as f:
        vocab = f.readlines()
        vocab = [i.strip() for i in vocab]
        # print(len(vocab))

    tokenized_sentence = ['[CLS]'] + list(sentence) + ["[SEP]"] # 在句子開頭添加[cls]
    token_ids = [vocab.index(token) for token in tokenized_sentence]

    return token_ids

# 加載保存的模型數據
model_data = np.load('bert_model_params.npz')

word_embeddings = model_data["bert.embeddings.word_embeddings.weight"]
position_embeddings = model_data["bert.embeddings.position_embeddings.weight"]
token_type_embeddings = model_data["bert.embeddings.token_type_embeddings.weight"]
def model_input(sentence):
    token_ids = tokenize_sentence(sentence)
    input_ids = np.array(token_ids)  # 輸入的詞彙id
    word_embedded = word_embedding(input_ids, word_embeddings)

    position_ids = np.array(range(len(input_ids)))  # 位置id
    # 位置嵌入矩陣，形狀爲 (max_position, embedding_size)
    position_embedded = position_embedding(position_ids, position_embeddings)

    token_type_ids = np.array([0]*len(input_ids))  # 片段類型id
    # 片段類型嵌入矩陣，形狀爲 (num_token_types, embedding_size)
    token_type_embedded = token_type_embedding(token_type_ids, token_type_embeddings)

    embedding_output = np.expand_dims(word_embedded + position_embedded + token_type_embedded, axis=0)


    return embedding_output


def bert(input,num_heads):

    ebd_LayerNorm_weight = model_data['bert.embeddings.LayerNorm.weight']
    ebd_LayerNorm_bias = model_data['bert.embeddings.LayerNorm.bias']
    input = layer_normalization(input,ebd_LayerNorm_weight,ebd_LayerNorm_bias)     #這裏和模型輸出一致

    for i in range(12):
        # 調用多頭自注意力函數
        W_Q = model_data['bert.encoder.layer.{}.attention.self.query.weight'.format(i)]
        B_Q = model_data['bert.encoder.layer.{}.attention.self.query.bias'.format(i)]
        W_K = model_data['bert.encoder.layer.{}.attention.self.key.weight'.format(i)]
        B_K = model_data['bert.encoder.layer.{}.attention.self.key.bias'.format(i)]
        W_V = model_data['bert.encoder.layer.{}.attention.self.value.weight'.format(i)]
        B_V = model_data['bert.encoder.layer.{}.attention.self.value.bias'.format(i)]
        W_O = model_data['bert.encoder.layer.{}.attention.output.dense.weight'.format(i)]
        B_O = model_data['bert.encoder.layer.{}.attention.output.dense.bias'.format(i)]
        attention_output_LayerNorm_weight = model_data['bert.encoder.layer.{}.attention.output.LayerNorm.weight'.format(i)]
        attention_output_LayerNorm_bias = model_data['bert.encoder.layer.{}.attention.output.LayerNorm.bias'.format(i)]
        intermediate_weight = model_data['bert.encoder.layer.{}.intermediate.dense.weight'.format(i)]
        intermediate_bias = model_data['bert.encoder.layer.{}.intermediate.dense.bias'.format(i)]
        dense_weight = model_data['bert.encoder.layer.{}.output.dense.weight'.format(i)]
        dense_bias = model_data['bert.encoder.layer.{}.output.dense.bias'.format(i)]
        output_LayerNorm_weight = model_data['bert.encoder.layer.{}.output.LayerNorm.weight'.format(i)]
        output_LayerNorm_bias = model_data['bert.encoder.layer.{}.output.LayerNorm.bias'.format(i)]

        output = multihead_attention(input, num_heads,W_Q,B_Q,W_K,B_K,W_V,B_V,W_O,B_O)
        output = residual_connection(input,output)
        output1 = layer_normalization(output,attention_output_LayerNorm_weight,attention_output_LayerNorm_bias)    #這裏和模型輸出一致

        output = feed_forward_layer(output1, intermediate_weight.T, intermediate_bias, activation='gelu')
        output = feed_forward_layer(output, dense_weight.T, dense_bias, activation='')
        output = residual_connection(output1,output)
        output2 = layer_normalization(output,output_LayerNorm_weight,output_LayerNorm_bias)   #一致

        
        input = output2

    bert_pooler_dense_weight = model_data['bert.pooler.dense.weight']
    bert_pooler_dense_bias = model_data['bert.pooler.dense.bias']
    output = feed_forward_layer(output2, bert_pooler_dense_weight.T, bert_pooler_dense_bias, activation='tanh')    #一致
    return output


# for i in model_data:
#     # print(i)
#     print(i,model_data[i].shape)
id2label = {0: 'mainland China politics', 1: 'Hong Kong - Macau politics', 2: 'International news', 3: 'financial news', 4: 'culture', 5: 'entertainment', 6: 'sports'}
classifier_weight = model_data['classifier.weight']
classifier_bias = model_data['classifier.bias']
if __name__ == "__main__":

    sentences = ["馬拉松比賽","香港有羣衆遊行示威","黨中央決定製定愛國教育法","俄羅斯和歐美對抗","人民幣匯率貶值","端午節喫糉子","大媽們跳廣場舞"]
    while True:
        # 示例用法
        for sentence in sentences:
            # print(model_input(sentence).shape)
            output = bert(model_input(sentence),num_heads=12)
            # print(output)
            output = feed_forward_layer(output[:,0,:], classifier_weight.T, classifier_bias, activation='')
            # print(output)
            output = softmax(output,axis=-1)
            label_id = np.argmax(output,axis=-1)
            label_score = output[0][label_id]
            print("sentence:",sentence,"\tlabels:",id2label[label_id[0]],"\tscore:",label_score)

這是hugging face上找的一個別人訓練好的模型，roberta模型作新聞7分類，並且保存模型結構爲numpy格式，爲了上面的代碼加載

import numpy as np
from transformers import AutoModelForSequenceClassification,AutoTokenizer,pipeline
model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
print(text_classification("馬拉松決賽"))

# print(model)


# 打印BERT模型的權重維度
for name, param in model.named_parameters():
    print(name, param.data.shape)

# # # 保存模型參數爲NumPy格式
model_params = {name: param.data.cpu().numpy() for name, param in model.named_parameters()}
np.savez('bert_model_params.npz', **model_params)
# model_params

對比兩個結果：

hugging face：[{'label': 'sports', 'score': 0.9929242134094238}]

numpy：sports [0.9928773]

我在樹莓派上跑通了bert模型，使用numpy實現bert模型，使用hugging face 或pytorch訓練模型，保存參數爲numpy格式，然後使用numpy加載模型推理

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

deepspeed 訓練多機多卡報錯 ncclSystemError Last error

如何實現圖像搜索，文搜圖，圖搜圖，CLIP+faiss向量數據庫實現圖像高效搜索

使用單卡qlora混合精度訓練大模型chatGLM2-6b，解決qlora loss變成nan的問題！

我用numpy實現了VIT，手寫vision transformer, 可在樹莓派上運行，在hugging face上訓練模型保存參數成numpy格式，純numpy實現

我用numpy實現了GPT-2，GPT-2源碼，GPT-2模型加速推理，並且可以在樹莓派上運行，讀了不少hungging face源碼，手動實現了numpy的GPT2模型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結