xingjun.wang
|
48c0d2a9af
|
add 1.6
|
2023-05-22 10:53:18 +08:00 |
|
yuze.zyz
|
a0bc5549a1
|
trainer support parallel_groups
Design doc: https://yuque.alibaba-inc.com/suluyan.sly/yh1rvu/yx0owblyebpa2b3l?singleDoc#flU3s
1. Add parallel_group field in trainer to support DP, TP, PP.
2. Move the construction of common hooks(except optimizer/lrscheduler hook) to trainer's init method to support after_init stage.
after_init is to support DP, TP, PP's initializing
https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
3. Add before_eval/after_eval stage to support model wrapping.
to solve the order problem of apex amp & ddp wrapping.
https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
4. Exporter supports lazy importing.
https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48122780
5. Fold all megatron imports to megatron hook.
https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48099986
6. Add compile method to TorchModel ,Pipeline,Trainer to support torch2.0
https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=46869415
7. Fix bug: Lrscheduler builder does not support torch2.0
8. Add callbacks for trainer
https://aone.alibaba-inc.com/v2/workitem#viewIdentifier=1c46ee8637e0c978f115b6f7&openWorkitemIdentifier=48210342
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11849932
|
2023-03-09 21:33:35 +08:00 |
|
yuze.zyz
|
7181e667f6
|
Refactor hooks
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11651547
|
2023-02-28 13:51:01 +08:00 |
|
zhangzhicheng.zzc
|
b94bb74f66
|
[to #42322933]Add model.save_pretrained method and allow finetune results used by pipeline
|
2022-08-24 21:39:08 +08:00 |
|
jiangnana.jnn
|
76482cc3ea
|
[to #43850241] fix processor and collate_fn
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9644184
* fix ditributed training and eval
|
2022-08-16 12:04:07 +08:00 |
|
zhangzhicheng.zzc
|
9d0b38b4e4
|
[to #42322933] lazy load on trainer
|
2022-08-04 14:07:14 +08:00 |
|
wenmeng.zwm
|
4814b198f0
|
[to #43112534] taskdataset refine and auto placement for data and model
* refine taskdataset interface
* add device placement for trainer
* add device placement for pipeline
* add config checker and fix model placement bug
* fix cycling import
* refactor model init for translation_pipeline
* cv pipelines support kwargs
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9463076
|
2022-07-23 11:08:43 +08:00 |
|
feiwu.yfw
|
2c3875c0e1
|
[to #43299989] Fix msdataset
* fix msdataset
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9436292
* fix msdataset
|
2022-07-20 16:38:15 +08:00 |
|
jiangnana.jnn
|
f3d739bea7
|
[to #43105545] add default config and new hooks
|
2022-07-19 17:41:25 +08:00 |
|