HRNET使用過程中的問題記錄

HRNET使用過程中的問題記錄


背景:最近在嘗試HRNet,把遇到的一些問題和解決方法簡單做個記錄~~可能會接着更新

訓練COCO數據集

按照官網教程,使用’‘python tools/train.py --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml’'進行訓練時,出現錯誤如下:

Traceback (most recent call last):
  File "tools/train.py", line 223, in <module>
    main()
  File "tools/train.py", line 111, in main
    writer_dict['writer'].add_graph(model, (dump_input, ))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/writer.py", line 566, in add_graph
    self.file_writer.add_graph(graph(model, input_to_model, verbose))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 240, in graph
    list_of_nodes, node_stats = parse(graph, args, omit_useless_nodes)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 161, in parse
    nodes_py.append(Node_py_OP(node))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 74, in __init__
    super(Node_py_OP, self).__init__(Node_cpp, methods_OP)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 54, in __init__
    io_tensorSize_list.append(n.type().sizes())
RuntimeError: r ASSERT FAILED at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/core/jit_type.h:142, please report a bug to PyTorch. (expect at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/core/jit_type.h:142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f8ff847edc5 in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: std::shared_ptr<c10::CompleteTensorType> c10::Type::expect<c10::CompleteTensorType>() + 0xa3 (0x7f9027766663 in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x4211bb (0x7f902777c1bb in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x12ce4a (0x7f9027487e4a in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #43: __libc_start_main + 0xf0 (0x7f902df02830 in /lib/x86_64-linux-gnu/libc.so.6)

一開始,我按關鍵詞“Error::Error(c10::SourceLocation, std::string const&) + 0x45”搜索找到了這篇文章mmdetection訓練報錯,博主說是“coco格式的annotations.json中categories的category_id不能有0(即背景類)”,我想也有道理,我下載的是原始COCO數據集,確實沒做清洗,於是自己寫了點腳本去掉了category_id=0的annotation。

再次輸入命令,再次出現錯誤。

於是,換個關鍵詞“ASSERT FAILED at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/core/jit_type.h:142”進行搜索,仍然沒有結果,只好科學上網。這回找到了pytorch的issues,按照上面的指示,將train.py中的from tensorboardX import SummaryWriter換成from torch.utils.tensorboard import SummaryWriter。

上面的bug解了,換個新bug:

Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/tensorboard/__init__.py", line 2, in <module>
    from tensorboard.summary.writer.record_writer import RecordWriter  # noqa F401
ModuleNotFoundError: No module named 'tensorboard'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 24, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    raise ImportError('TensorBoard logging requires TensorBoard with Python summary writer installed. '
ImportError: TensorBoard logging requires TensorBoard with Python summary writer installed. This should be available in 1.14 or above.

按issue上的指導應該是pip install tb-nightly,但現在可以直接pin install tensorboard就可以了,最後import torch.utils.tensorboard一下就知道裝成功沒有。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章