Things not Very Sure About
When using multi-loss, for example, knowledge distillation and softmax classification, how are these two gradients integrated? In model-parallel-softmax model, gradients caused by classification i.e. out_grad
is passed to executor_manager
explicitly. How about the gradients generated by knowledge distillation part?
- [ ] Efforts should be taken to dive deep in
DataParallelExecutorManager
andExecutor
classes.