大模型架構之MOE

transformers庫裏面的modeling_mistral.py


MistralModel(
  (embed_tokens): Embedding(32000, 4096)
  (layers): ModuleList(
    (0-1): 2 x MistralDecoderLayer(
      (self_attn): MistralSdpaAttention(
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (rotary_emb): MistralRotaryEmbedding()
      )
      (mlp): MistralMLP(
        (gate_proj): Linear(in_features=4096, out_features=2, bias=False)
        (up_proj): Linear(in_features=4096, out_features=2, bias=False)
        (down_proj): Linear(in_features=2, out_features=4096, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): MistralRMSNorm()
      (post_attention_layernorm): MistralRMSNorm()
    )
  )
  (norm): MistralRMSNorm()
)

debug代碼

import transformers
a=transformers.MistralModel
b=a(transformers.MistralConfig(num_hidden_layers=2,intermediate_size=2))
print(1)
import torch
a=b(torch.tensor([1,2,4]).unsqueeze(0))
print(a)
print(1)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章