Things not Very Sure About

When using multi-loss, for example, knowledge distillation and softmax classification, how are these two gradients integrated? In model-parallel-softmax model, gradients caused by classification i.e. out_grad is passed to executor_manager explicitly. How about the gradients generated by knowledge distillation part?

  • [ ] Efforts should be taken to dive deep in DataParallelExecutorManager and Executor classes.

results matching ""

    No results matching ""