Pytorch DDP would fail when using the parameters directly to calculate the loss.
These are my scripts:
# train.py:
class Model(nn.Module):
def __init__(self, params):
...
self.xnli_proj = nn.Linear(dim, 3)
...
model = Model(params)
output = model.xnli_proj(encoder_output)
And,
python -m torch.distributed.launch --nproc_per_node=8 train.py
The error I got:
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: AttributeError: 'DistributedDataParallel' object has no attribute 'xnli_proj'
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: Traceback (most recent call last):
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: File "./UVL/train_nlp.py", line 440, in <module>
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: main(params)
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: File "./UVL/train_nlp.py", line 401, in main
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: trainer.xnli_step(lang1, lang2, params.lambda_xnli)
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: File "/disk/qiaolin/UVL/src/xnli.py", line 902, in xnli_step
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: output = self.model.xnli_proj(encoder_output)
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 585, in __getattr__
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: type(self).__name__, name))
2020-05-11T15:30:28.000Z /container_e2240_1583898264103_81566_01_000002: AttributeError: 'DistributedDataParallel' object has no attribute 'xnli_proj'
And, this is the solution:
https://www.gitmemory.com/issue/NVIDIA/apex/436/529107546
In short, to use DDP, put all parameters in explicit feedforward functions, to let it know they should be collected.