Commit Graph

12 Commits

Author SHA1 Message Date
yuze.zyz
a0bc5549a1 trainer support parallel_groups
Design doc: https://yuque.alibaba-inc.com/suluyan.sly/yh1rvu/yx0owblyebpa2b3l?singleDoc#flU3s

1. Add parallel_group field in trainer to support DP, TP, PP.
2. Move the construction of common hooks(except optimizer/lrscheduler hook) to trainer's init method to support after_init stage.
	after_init is to support DP, TP, PP's initializing
         https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
3. Add before_eval/after_eval stage to support model wrapping.
	to solve the order problem of apex amp & ddp wrapping.
         https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
4. Exporter supports lazy importing.
	https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48122780
5. Fold all megatron imports to megatron hook.
         https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
6. Add compile method to TorchModel ,Pipeline,Trainer to support torch2.0
	https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=46869415
7. Fix bug: Lrscheduler builder does not support torch2.0
8. Add callbacks for trainer
	https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48210342
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11849932
2023-03-09 21:33:35 +08:00
yuze.zyz
4dca4773db Support csanmt exporting and refactor some code
1. Support csanmt exporting to savedmodel format
2. Create a new base class for text-ranking preprocessors, and move some parameters of mgeo_ranking_preprocessor to init method
3. Avoid Model & Preprocessor classes coupled with pytorch
4. Regression test supports comparing only model output
5. Support zero-shot exporting to onnx and torchscript

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11522461
2023-02-10 05:15:04 +00:00
wenmeng.zwm
c8dcdd93da broadcase metric values across all workers for distribution
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10980488
2022-12-08 10:22:47 +08:00
yingda.chen
4e4faa9a30 specifiy file encoding when open text for read
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10708723
2022-11-14 14:16:08 +08:00
jiangnana.jnn
1794e08af7 fix dist training
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10185634

    * fix dist training
2022-09-21 17:47:50 +08:00
jiangnana.jnn
5e176da3a1 adapt to msdataset for EasyCV
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9935664

    * adapt to msdataset for EasyCV
2022-09-09 10:01:51 +08:00
zhangzhicheng.zzc
b94bb74f66 [to #42322933]Add model.save_pretrained method and allow finetune results used by pipeline 2022-08-24 21:39:08 +08:00
jiangnana.jnn
cfc3d1eed7 fix trainer about iters_per_epoch
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9791200

    * fix trainer about iters_per_epoch
2022-08-17 20:06:25 +08:00
jiangnana.jnn
6f5b864735 [to #43850241] fix unittest
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9660779

    * fix unittest
2022-08-05 18:39:59 +08:00
zhangzhicheng.zzc
9d0b38b4e4 [to #42322933] lazy load on trainer 2022-08-04 14:07:14 +08:00
jiangnana.jnn
34840fc5d8 [to #43627720] support ReduceLROnPlateau and fix lr scheduler
1. Support `ReduceLROnPlateau` lr scheduler, and add  `PlateauLrSchedulerHook` for it
2. Support custom `optimizer_hook` and `lr_scheduler_hook`
3. Remove function of save best ckpt from `EvaluationHook`, replace with `BestCkptSaverHook`
4. `evaluation_loop` return metric values directly,move metric computation to `single_gpu_test` and `multi_gpu_test`
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9584322

    * [to #43627720] support ReduceLROnPlateau and fix lr scheduler
2022-08-02 14:49:48 +08:00
jiangnana.jnn
21437650f1 [to #43627720] support distributed training
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9551089

    * support distributed training
2022-07-28 17:43:23 +08:00