Commit Graph

10 Commits

Author SHA1 Message Date
hemu.zp
2b1af959d5 Convert cfg during training
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11900238
2023-03-09 22:27:44 +08:00
yuze.zyz
a0bc5549a1 trainer support parallel_groups
Design doc: https://yuque.alibaba-inc.com/suluyan.sly/yh1rvu/yx0owblyebpa2b3l?singleDoc#flU3s

1. Add parallel_group field in trainer to support DP, TP, PP.
2. Move the construction of common hooks(except optimizer/lrscheduler hook) to trainer's init method to support after_init stage.
	after_init is to support DP, TP, PP's initializing
         https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
3. Add before_eval/after_eval stage to support model wrapping.
	to solve the order problem of apex amp & ddp wrapping.
         https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
4. Exporter supports lazy importing.
	https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48122780
5. Fold all megatron imports to megatron hook.
         https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
6. Add compile method to TorchModel ,Pipeline,Trainer to support torch2.0
	https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=46869415
7. Fix bug: Lrscheduler builder does not support torch2.0
8. Add callbacks for trainer
	https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48210342
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11849932
2023-03-09 21:33:35 +08:00
yuze.zyz
7181e667f6 Refactor hooks
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11651547
2023-02-28 13:51:01 +08:00
zhangzhicheng.zzc
b94bb74f66 [to #42322933]Add model.save_pretrained method and allow finetune results used by pipeline 2022-08-24 21:39:08 +08:00
jiangnana.jnn
cfc3d1eed7 fix trainer about iters_per_epoch
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9791200

    * fix trainer about iters_per_epoch
2022-08-17 20:06:25 +08:00
jiangnana.jnn
76482cc3ea [to #43850241] fix processor and collate_fn
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9644184

    * fix ditributed training and eval
2022-08-16 12:04:07 +08:00
zhangzhicheng.zzc
9d0b38b4e4 [to #42322933] lazy load on trainer 2022-08-04 14:07:14 +08:00
feiwu.yfw
2c3875c0e1 [to #43299989] Fix msdataset
* fix msdataset
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9436292

    * fix msdataset
2022-07-20 16:38:15 +08:00
jiangnana.jnn
f3d739bea7 [to #43105545] add default config and new hooks 2022-07-19 17:41:25 +08:00
wenmeng.zwm
231f400133 [to #43112534] finetune support and first case
co-contributed with 夕陌&雨泓

 * add torch epoch based trainer and dis utils
 * add hooks including optimizer, lrscheduler, logging, checkpoint, evaluation, time profiling
 * add torch mdoel base and test
 * add optimizer and lrscheduler module
 * add sbert for text classification example
 * add task_dataset for dataset-level processor

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9338412
2022-07-14 16:25:55 +08:00