mirror of
https://github.com/modelscope/modelscope.git
synced 2025-12-20 18:19:21 +01:00
Design doc: https://yuque.alibaba-inc.com/suluyan.sly/yh1rvu/yx0owblyebpa2b3l?singleDoc#flU3s 1. Add parallel_group field in trainer to support DP, TP, PP. 2. Move the construction of common hooks(except optimizer/lrscheduler hook) to trainer's init method to support after_init stage. after_init is to support DP, TP, PP's initializing https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986 3. Add before_eval/after_eval stage to support model wrapping. to solve the order problem of apex amp & ddp wrapping. https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986 4. Exporter supports lazy importing. https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48122780 5. Fold all megatron imports to megatron hook. https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986 6. Add compile method to TorchModel ,Pipeline,Trainer to support torch2.0 https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=46869415 7. Fix bug: Lrscheduler builder does not support torch2.0 8. Add callbacks for trainer https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48210342 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11849932