HRNET使用过程中的问题记录

HRNET使用过程中的问题记录


背景:最近在尝试HRNet,把遇到的一些问题和解决方法简单做个记录~~可能会接着更新

训练COCO数据集

按照官网教程,使用’‘python tools/train.py --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml’'进行训练时,出现错误如下:

Traceback (most recent call last):
  File "tools/train.py", line 223, in <module>
    main()
  File "tools/train.py", line 111, in main
    writer_dict['writer'].add_graph(model, (dump_input, ))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/writer.py", line 566, in add_graph
    self.file_writer.add_graph(graph(model, input_to_model, verbose))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 240, in graph
    list_of_nodes, node_stats = parse(graph, args, omit_useless_nodes)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 161, in parse
    nodes_py.append(Node_py_OP(node))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 74, in __init__
    super(Node_py_OP, self).__init__(Node_cpp, methods_OP)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 54, in __init__
    io_tensorSize_list.append(n.type().sizes())
RuntimeError: r ASSERT FAILED at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/core/jit_type.h:142, please report a bug to PyTorch. (expect at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/core/jit_type.h:142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f8ff847edc5 in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: std::shared_ptr<c10::CompleteTensorType> c10::Type::expect<c10::CompleteTensorType>() + 0xa3 (0x7f9027766663 in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x4211bb (0x7f902777c1bb in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x12ce4a (0x7f9027487e4a in /root/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #43: __libc_start_main + 0xf0 (0x7f902df02830 in /lib/x86_64-linux-gnu/libc.so.6)

一开始,我按关键词“Error::Error(c10::SourceLocation, std::string const&) + 0x45”搜索找到了这篇文章mmdetection训练报错,博主说是“coco格式的annotations.json中categories的category_id不能有0(即背景类)”,我想也有道理,我下载的是原始COCO数据集,确实没做清洗,于是自己写了点脚本去掉了category_id=0的annotation。

再次输入命令,再次出现错误。

于是,换个关键词“ASSERT FAILED at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/core/jit_type.h:142”进行搜索,仍然没有结果,只好科学上网。这回找到了pytorch的issues,按照上面的指示,将train.py中的from tensorboardX import SummaryWriter换成from torch.utils.tensorboard import SummaryWriter。

上面的bug解了,换个新bug:

Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/tensorboard/__init__.py", line 2, in <module>
    from tensorboard.summary.writer.record_writer import RecordWriter  # noqa F401
ModuleNotFoundError: No module named 'tensorboard'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 24, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    raise ImportError('TensorBoard logging requires TensorBoard with Python summary writer installed. '
ImportError: TensorBoard logging requires TensorBoard with Python summary writer installed. This should be available in 1.14 or above.

按issue上的指导应该是pip install tb-nightly,但现在可以直接pin install tensorboard就可以了,最后import torch.utils.tensorboard一下就知道装成功没有。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章