35 Commits

Author SHA1 Message Date
xingjun.wang
48c0d2a9af add 1.6 2023-05-22 10:53:18 +08:00
yuze.zyz
bb5512d1ab [to #42322933] Refactor NLP and fix some user feedbacks
1. Abstract keys of dicts needed by nlp metric classes into the init method
2. Add Preprocessor.save_pretrained to save preprocessor information
3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training.
4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead
5. Use model/preprocessor's from_pretrained in all nlp pipeline classes.
6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes
7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes
8. Fix user feedback: Re-train the model in continue training scenario
9. Fix user feedback: Too many checkpoint saved
10. Simplify the nlp-trainer
11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override
12. Add safe_get to Config class

----------------------------  Another refactor from version 36 -------------------------

13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example:
      TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor
14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors
15. Add output classes of nlp models
16. Refactor the logic for token-classification
17. Fix bug: checkpoint_hook does not support pytorch_model.pt
18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training
       NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513

    * add save_pretrained to preprocessor

* save preprocessor config in hook

* refactor label-id mapping fetching logic

* test ok on sentence-similarity

* run on finetuning

* fix bug

* pre-commit passed

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/preprocessors/nlp/nlp_base.py

* add params to init

* 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics

* Split trainer init impls to overridable methods

* remove some obsolete tokenizers

* unfinished

* support input params in pipeline

* fix bugs

* fix ut bug

* fix bug

* fix ut bug

* fix ut bug

* fix ut bug

* add base class for some preprocessors

* Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config

* compatible with old code

* fix ut bug

* fix ut bugs

* fix bug

* add some comments

* fix ut bug

* add a requirement

* fix pre-commit

* Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config

* fixbug

* Support function type in registry

* fix ut bug

* fix bug

* Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config

# Conflicts:
#	modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/utils/hub.py

* remove obsolete file

* rename init args

* rename params

* fix merge bug

* add default preprocessor config for ner-model

* move a method a util file

* remove unused config

* Fix a bug in pbar

* bestckptsaver:change default ckpt numbers to 1

* 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name

* Fix bug

* fix bug

* fix bug

* unfinished refactoring

* unfinished

* uw

* uw

* uw

* uw

* Merge branch 'feat/refactor_config' into feat/refactor_trainer

# Conflicts:
#	modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
#	modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
#	modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
#	modelscope/preprocessors/nlp/text_generation_preprocessor.py

* uw

* uw

* unify nlp task outputs

* uw

* uw

* uw

* uw

* change the order of text cls pipeline

* refactor t5

* refactor tg task preprocessor

* fix

* unfinished

* temp

* refactor code

* unfinished

* unfinished

* unfinished

* unfinished

* uw

* Merge branch 'feat/refactor_config' into feat/refactor_trainer

* smoke test pass

* ut testing

* pre-commit passed

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/models/nlp/bert/document_segmentation.py
#	modelscope/pipelines/nlp/__init__.py
#	modelscope/pipelines/nlp/document_segmentation_pipeline.py

* merge master

* unifnished

* Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config

* fix bug

* fix ut bug

* support ner batch inference

* fix ut bug

* fix bug

* support batch inference on three nlp tasks

* unfinished

* fix bug

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/models/base/base_model.py
#	modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
#	modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
#	modelscope/pipelines/nlp/dialog_modeling_pipeline.py
#	modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
#	modelscope/pipelines/nlp/document_segmentation_pipeline.py
#	modelscope/pipelines/nlp/faq_question_answering_pipeline.py
#	modelscope/pipelines/nlp/feature_extraction_pipeline.py
#	modelscope/pipelines/nlp/fill_mask_pipeline.py
#	modelscope/pipelines/nlp/information_extraction_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/sentence_embedding_pipeline.py
#	modelscope/pipelines/nlp/summarization_pipeline.py
#	modelscope/pipelines/nlp/table_question_answering_pipeline.py
#	modelscope/pipelines/nlp/text2text_generation_pipeline.py
#	modelscope/pipelines/nlp/text_classification_pipeline.py
#	modelscope/pipelines/nlp/text_error_correction_pipeline.py
#	modelscope/pipelines/nlp/text_generation_pipeline.py
#	modelscope/pipelines/nlp/text_ranking_pipeline.py
#	modelscope/pipelines/nlp/token_classification_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
#	modelscope/trainers/nlp_trainer.py

* pre-commit passed

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/preprocessors/__init__.py

* fix bug

* fix bug

* fix bug

* fix bug

* fix bug

* fixbug

* pre-commit passed

* fix bug

* fixbug

* fix bug

* fix bug

* fix bug

* fix bug

* self review done

* fixbug

* fix bug

* fix bug

* fix bugs

* remove sub-token offset mapping

* fix name bug

* add some tests

* 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs

* add old logic back

* tmp save

* add tokenize by words logic back

* move outputs file back

* revert veco token-classification back

* fix typo

* Fix description

* Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/pipelines/builder.py
2022-11-30 23:52:17 +08:00
yuze.zyz
605cd7f44a [to #42322933] NLP 1030 Refactor
Features:
1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder
2. Refactor all the comments to google style
3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer
4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it)
5. Refactor model save_pretrained method to support direct running(independent from trainer)
6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines
7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg.
8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call
9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class.
10. Support Preprocessor.from_pretrained method
11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs.
12. Split the file of the nlp preprocessors, to make the dir structure more clear.

Bugs Fixing:
1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step
2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error
3. Fix a bug that the trainer will not call the correct TaskDataset class
4. Fix a bug that the internal loading of dataset will throws error in the trainer class
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
2022-10-25 12:26:25 +08:00
yuze.zyz
707cbef013 [to #42322933]Fix bug in daily UT
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10491891
2022-10-22 23:25:18 +08:00
zhangzhicheng.zzc
d721fabb34 [to #42322933]bert with sequence classification / token classification/ fill mask refactor
1.新增支持原始bert模型(非easynlp的 backbone prefix版本)
2.支持bert的在sequence classification/fill mask /token classification上的backbone head形式
3.统一了sequence classification几个任务的pipeline到一个类
4.fill mask 支持backbone head形式
5.token classification的几个子任务(ner,word seg, part of speech)的preprocessor 统一到了一起TokenClassificationPreprocessor
6. sequence classification的几个子任务(single classification, pair classification)的preprocessor 统一到了一起SequenceClassificationPreprocessor
7. 改动register中 cls的group_key 赋值位置,之前的group_key在多个decorators的情况下,会被覆盖,obj_cls的group_key信息不正确
8. 基于backbone head形式将 原本group_key和 module同名的情况尝试做调整,如下在modelscope/pipelines/nlp/sequence_classification_pipeline.py 中 
原本
 @PIPELINES.register_module(
    Tasks.sentiment_classification, module_name=Pipelines.sentiment_classification)
改成
@PIPELINES.register_module(
    Tasks.text_classification, module_name=Pipelines.sentiment_classification)
相应的configuration.json也有改动,这样的改动更符合任务和pipline(子任务)的关系。
8. 其他相应改动为支持上述功能
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10041463
2022-09-27 23:08:33 +08:00
Yingda Chen
e0ef60ca9b [to #42322933] skip demo test by default 2022-09-09 14:56:33 +08:00
lingcai.wl
7a49fa1cc6 [to #44657982] add unittest for demo and demotest utils
unittest for demo service
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10006180
2022-09-08 14:08:51 +08:00
yingda.chen
45620dbc7f [to #42322933]clean up test level
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9662182

    * clean up test level
2022-08-06 12:22:17 +08:00
wenmeng.zwm
d55525bfb6 [to #43112771] requirements check and lazy import support 2022-07-27 17:29:16 +08:00
wenmeng.zwm
4814b198f0 [to #43112534] taskdataset refine and auto placement for data and model
* refine taskdataset interface
 * add device placement for trainer
 * add device placement for pipeline
 * add config checker and fix model placement bug
 * fix cycling import
 * refactor model init for translation_pipeline
 * cv pipelines support kwargs


Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9463076
2022-07-23 11:08:43 +08:00
feiwu.yfw
2c3875c0e1 [to #43299989] Fix msdataset
* fix msdataset
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9436292

    * fix msdataset
2022-07-20 16:38:15 +08:00
feiwu.yfw
5da470fd5d [to #42791465, #42779255, #42777959, #42757844, #42756050, #42746916, #42743595, #42791863] fix: fix msdataset
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9174075

* fix msdataset
2022-06-28 20:40:57 +08:00
yingda.chen
6702b29e21 [to #42794773]rename pydataset to msdataset
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9165402
2022-06-27 11:09:38 +08:00
Yingda Chen
b6e3fd80b0 Revert "[to #42794773] rename pydataset to msdataset"
This reverts commit c8e2e6de0e.
2022-06-25 08:50:28 +08:00
Yingda Chen
c8e2e6de0e [to #42794773] rename pydataset to msdataset 2022-06-25 08:36:48 +08:00
yingda.chen
e7571a566f [to #42322933] skip dataset test for now
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9155405
2022-06-24 11:47:28 +08:00
yingda.chen
1a0d4af55a [to #42322933] test level check
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9143809
2022-06-23 16:55:48 +08:00
wenmeng.zwm
e288cf076e [to #42362853] refactor pipeline and standardize module_name
* using get_model to validate hub path 
* support reading pipeline info from configuration file
* add metainfo const
* update model type and pipeline type and fix UT
* relax requimrent for protobuf
* skip two dataset tests due to temporal failure
 
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9118154
2022-06-22 14:15:32 +08:00
mulin.lyh
76c6ff6329 [to #42675838]merge model hub code
合并model hub 代码
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9096493
2022-06-21 20:04:25 +08:00
feiwu.yfw
c7238a470b [to #42670107]pydataset fetch data from datahub
* pydataset fetch data from datahub
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9060856
2022-06-21 11:10:28 +08:00
wenmeng.zwm
c59833c7ee [to #42461396] feat: test_level support
* add test level support
* update develop doc
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9021354
2022-06-15 14:53:49 +08:00
yingda.chen
b31c86aa0e [to #42409340] add hub specifier
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9005038
2022-06-13 14:15:54 +08:00
wenmeng.zwm
1f6b376599 [to #42373878] refactor maaslib to modelscope
1.  refactor maaslib to modelscope
2.  fix UT error
3.  support pipeline which does not register default model

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8988388
2022-06-09 20:16:26 +08:00
wenmeng.zwm
dd00195814 [to #42362853] add default model support and fix circular import
1. add default model support
2. fix circular import
3. temporarily skip ofa and palm test which costs too much time

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8981076
2022-06-09 16:57:33 +08:00
yingda.chen
0d840d519c [to #42339763] move pydataset into maas_lib
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8974892
2022-06-09 10:14:48 +08:00
yingda.chen
e3b8ec3bf1 [to #42339559] support multiple models
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8972440

    * [to #42339559] support multiple models
2022-06-08 21:27:14 +08:00
feiwu.yfw
235880f300 [to #42339763] merge pydataset into maas-lib
* merge pydataset to the repo
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8955999
2022-06-08 18:29:39 +08:00
yingda.chen
d6868ddffe [to #42323743] retain local cached model files by default
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8963687
2022-06-08 14:22:23 +08:00
yingda.chen
e075ad2245 [to #42322515]support plain pipeline for bert
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8945177

    * support plain pipeline for bert
2022-06-08 11:29:25 +08:00
yingda.chen
f8eb699f7f refine tests and examples
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8898823
2022-06-01 10:20:53 +08:00
wenmeng.zwm
1d01a78c2b fix: UT error
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8899458

    * fix: UT error
2022-06-01 09:16:39 +08:00
yingda.chen
5995cc4607 add PyDataset support
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8868644
2022-05-31 18:27:19 +08:00
wenmeng.zwm
25a2028b54 [to #41401401] modelhub and Trainer support
* add trainer interface
 * add trainer script
 * add model init support for pipelineadd pipeline tutorial and fix bugs 
 * add text classification evaluation to maas lib 
 * add quickstart and prepare env doc
 * relax requirements for torch and sentencepiece
 * merge release/0.1 and fix conflict
 * modelhub support for model and pipeline

 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8868339
2022-05-30 11:53:53 +08:00
wenmeng.zwm
cb416edc2a [to #41669377] add pipeline tutorial and fix bugs
1. add pipleine tutorial
2. fix bugs when using pipeline with certain model and preprocessor

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8810524
2022-05-24 17:14:58 +08:00
wenmeng.zwm
5e469008fd [to #41401401] add preprocessor, model and pipeline
* add preprocessor module
 * add model base and builder
 * update task constant
 * add load image preprocessor and its dependency
 * add pipeline interface and UT covered
 * support default pipeline for task
 * add image matting pipeline
 * refine nlp tokenize interface
 * add nlp pipeline 
 * fix UT failed
 * add test for Compose

Link: https://code.aone.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8769235

* add preprocessor module

* add test for Compose

* fix citest error

* fix abs class error

* add model base and builder

* update task constant

* add load image preprocessor and its dependency

* add pipeline interface and UT covered

* support default pipeline for task

* refine models and pipeline interface

* add pipeline folder structure

* add image matting pipeline

* refine nlp tokenize interface

* add nlp pipeline 

1.add preprossor model pipeline for nlp text classification
2. add corresponding test

Link: https://code.aone.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/8757371

* new nlp pipeline

* format pre-commit code

* update easynlp pipeline

* update model_name for easynlp pipeline; add test for maas_lib/utils/typeassert.py

* update test_typeassert.py

* refactor code

1. rename typeassert to type_assert
2. use lazy import to make easynlp dependency optional
3. refine image matting UT

* fix linter test failed

* update requirements.txt

* fix UT failed

* fix citest script to update requirements
2022-05-19 22:18:35 +08:00