Things not Very Sure About

When using multi-loss, for example, knowledge distillation and softmax classification, how are these two gradients integrated? In model-parallel-softmax model, gradients caused by classification i.e. out_grad is passed to executor_manager explicitly. How about the gradients generated by knowledge distillation part?

[ ] Efforts should be taken to dive deep in DataParallelExecutorManager and Executor classes.

MXNet study

Things not Very Sure About

results matching ""

No results matching ""