* del _datasets_server import in hf_dataset_util
* fix streaming for youku-mplug and adopt latest datasets
* fix download config copy
* update ut
* add youku in test_general_datasets
* update UT for general dataset
* adapt to datasets version: 2.19.0 or later
* add assert for youku data UT
* fix disable_tqdm in some functions for 2.19.0 or later
* update get_module_with_script
* set trust_remote_code is True in load_dataset_with_ctx
* update print info
* update requirements for datasets version restriction
* fix _dataset_info
* add pillow
* update comments
* update comment
* reuse _download function in DataDownloadManager
* remove unused code
* update test_run_modelhub in Human3DAnimationTest
* set datasets>=2.18.0
1. Support ODPS datasource for virgo sdk.
2. Adapt "inner_url" parser for single-modal and multi-modal datasets.
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12583695
* add virgo sdk odps data
* refine virgo sdk args and odps fetch batch data pipeline
* add inner import for odps
* del import VirgoDataset in MsDataset
* fix import VirgoDataset
* support inner url downloading
* refine dataset.py and maxcompute_utils.py
* add ut for virgo odps data batch
* reset unifold ut level as 1
* refine virgo batch data
Refactor the task_datasets module:
1. Add new module modelscope.msdatasets.dataset_cls.custom_datasets.
2. Add new function: modelscope.msdatasets.ms_dataset.MsDataset.to_custom_dataset().
2. Add calling to_custom_dataset() func in MsDataset.load() to adapt new custom_datasets module.
3. Refactor the pipeline for loading custom dataset:
1) Only use MsDataset.load() function to load the custom datasets.
2) Combine MsDataset.load() with class EpochBasedTrainer.
4. Add new entry func for building datasets in EpochBasedTrainer: see modelscope.trainers.trainer.EpochBasedTrainer.build_dataset()
5. Add new func to build the custom dataset from model configuration, see: modelscope.trainers.trainer.EpochBasedTrainer.build_dataset_from_cfg()
6. Add new registry function for building custom datasets, see: modelscope.msdatasets.dataset_cls.custom_datasets.builder.build_custom_dataset()
7. Refine the class SiameseUIETrainer to adapt the new custom_datasets module.
8. Add class TorchCustomDataset as a superclass for custom datasets classes.
9. To move modules/classes/functions:
1) Move module msdatasets.audio to custom_datasets
2) Move module msdatasets.cv to custom_datasets
3) Move module bad_image_detecting to custom_datasets
4) Move module damoyolo to custom_datasets
5) Move module face_2d_keypoints to custom_datasets
6) Move module hand_2d_keypoints to custom_datasets
7) Move module human_wholebody_keypoint to custom_datasets
8) Move module image_classification to custom_datasets
9) Move module image_inpainting to custom_datasets
10) Move module image_portrait_enhancement to custom_datasets
11) Move module image_quality_assessment_degradation to custom_datasets
12) Move module image_quality_assmessment_mos to custom_datasets
13) Move class LanguageGuidedVideoSummarizationDataset to custom_datasets
14) Move class MGeoRankingDataset to custom_datasets
15) Move module movie_scene_segmentation custom_datasets
16) Move module object_detection to custom_datasets
17) Move module referring_video_object_segmentation to custom_datasets
18) Move module sidd_image_denoising to custom_datasets
19) Move module video_frame_interpolation to custom_datasets
20) Move module video_stabilization to custom_datasets
21) Move module video_super_resolution to custom_datasets
22) Move class GoproImageDeblurringDataset to custom_datasets
23) Move class EasyCVBaseDataset to custom_datasets
24) Move class ImageInstanceSegmentationCocoDataset to custom_datasets
25) Move class RedsImageDeblurringDataset to custom_datasets
26) Move class TextRankingDataset to custom_datasets
27) Move class VecoDataset to custom_datasets
28) Move class VideoSummarizationDataset to custom_datasets
10. To delete modules/functions/classes:
1) Del module task_datasets
2) Del to_task_dataset() in EpochBasedTrainer
3) Del build_dataset() in EpochBasedTrainer and renew a same name function.
11. Rename class Datasets to CustomDatasets in metainfo.py
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11872747
1. Fix the conflict between local path and remote dataset name in the form of dataset_name='namespace/dataset_name' in MsDataset.load() function.
2. Modify the obj_key.startswith value in get_split_objects_map function to adapt to dir name 'xxx/' format.
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11820290
* fix the conflict between local path and namespace/dataset_name of the dataset_name
* fix function: get_split_objects_map
* add UT for loading local csv file
* add new test case for test_load_local_csv function
1. support the form of '/to/path/abc.csv' in MsDataset.load() function
2. fix the compatibility issue of datasets
3. modify the resumable cache path for oss utils
4. add UT cases
1. Abstract keys of dicts needed by nlp metric classes into the init method
2. Add Preprocessor.save_pretrained to save preprocessor information
3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training.
4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead
5. Use model/preprocessor's from_pretrained in all nlp pipeline classes.
6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes
7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes
8. Fix user feedback: Re-train the model in continue training scenario
9. Fix user feedback: Too many checkpoint saved
10. Simplify the nlp-trainer
11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override
12. Add safe_get to Config class
---------------------------- Another refactor from version 36 -------------------------
13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example:
TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor
14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors
15. Add output classes of nlp models
16. Refactor the logic for token-classification
17. Fix bug: checkpoint_hook does not support pytorch_model.pt
18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training
NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513
* add save_pretrained to preprocessor
* save preprocessor config in hook
* refactor label-id mapping fetching logic
* test ok on sentence-similarity
* run on finetuning
* fix bug
* pre-commit passed
* fix bug
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/preprocessors/nlp/nlp_base.py
* add params to init
* 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics
* Split trainer init impls to overridable methods
* remove some obsolete tokenizers
* unfinished
* support input params in pipeline
* fix bugs
* fix ut bug
* fix bug
* fix ut bug
* fix ut bug
* fix ut bug
* add base class for some preprocessors
* Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config
* compatible with old code
* fix ut bug
* fix ut bugs
* fix bug
* add some comments
* fix ut bug
* add a requirement
* fix pre-commit
* Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config
* fixbug
* Support function type in registry
* fix ut bug
* fix bug
* Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config
# Conflicts:
# modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py
# modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
# modelscope/pipelines/nlp/word_segmentation_pipeline.py
# modelscope/utils/hub.py
* remove obsolete file
* rename init args
* rename params
* fix merge bug
* add default preprocessor config for ner-model
* move a method a util file
* remove unused config
* Fix a bug in pbar
* bestckptsaver:change default ckpt numbers to 1
* 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name
* Fix bug
* fix bug
* fix bug
* unfinished refactoring
* unfinished
* uw
* uw
* uw
* uw
* Merge branch 'feat/refactor_config' into feat/refactor_trainer
# Conflicts:
# modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
# modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
# modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
# modelscope/preprocessors/nlp/text_generation_preprocessor.py
* uw
* uw
* unify nlp task outputs
* uw
* uw
* uw
* uw
* change the order of text cls pipeline
* refactor t5
* refactor tg task preprocessor
* fix
* unfinished
* temp
* refactor code
* unfinished
* unfinished
* unfinished
* unfinished
* uw
* Merge branch 'feat/refactor_config' into feat/refactor_trainer
* smoke test pass
* ut testing
* pre-commit passed
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/models/nlp/bert/document_segmentation.py
# modelscope/pipelines/nlp/__init__.py
# modelscope/pipelines/nlp/document_segmentation_pipeline.py
* merge master
* unifnished
* Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config
* fix bug
* fix ut bug
* support ner batch inference
* fix ut bug
* fix bug
* support batch inference on three nlp tasks
* unfinished
* fix bug
* fix bug
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/models/base/base_model.py
# modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
# modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
# modelscope/pipelines/nlp/dialog_modeling_pipeline.py
# modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
# modelscope/pipelines/nlp/document_segmentation_pipeline.py
# modelscope/pipelines/nlp/faq_question_answering_pipeline.py
# modelscope/pipelines/nlp/feature_extraction_pipeline.py
# modelscope/pipelines/nlp/fill_mask_pipeline.py
# modelscope/pipelines/nlp/information_extraction_pipeline.py
# modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
# modelscope/pipelines/nlp/sentence_embedding_pipeline.py
# modelscope/pipelines/nlp/summarization_pipeline.py
# modelscope/pipelines/nlp/table_question_answering_pipeline.py
# modelscope/pipelines/nlp/text2text_generation_pipeline.py
# modelscope/pipelines/nlp/text_classification_pipeline.py
# modelscope/pipelines/nlp/text_error_correction_pipeline.py
# modelscope/pipelines/nlp/text_generation_pipeline.py
# modelscope/pipelines/nlp/text_ranking_pipeline.py
# modelscope/pipelines/nlp/token_classification_pipeline.py
# modelscope/pipelines/nlp/word_segmentation_pipeline.py
# modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
# modelscope/trainers/nlp_trainer.py
* pre-commit passed
* fix bug
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/preprocessors/__init__.py
* fix bug
* fix bug
* fix bug
* fix bug
* fix bug
* fixbug
* pre-commit passed
* fix bug
* fixbug
* fix bug
* fix bug
* fix bug
* fix bug
* self review done
* fixbug
* fix bug
* fix bug
* fix bugs
* remove sub-token offset mapping
* fix name bug
* add some tests
* 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs
* add old logic back
* tmp save
* add tokenize by words logic back
* move outputs file back
* revert veco token-classification back
* fix typo
* Fix description
* Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/pipelines/builder.py
1. Fix bugs in daily test
2. Fix a bug that the updating of lr is before the first time of updating of optimizer
TODO this will still cause warnings when GA is above 1
3. Remove the judgement of mode in text-classification's preprocessor to fit the base trainer(Bug)
Update some regression bins to fit the preprocessor
4. Update the regression tool to let outer code modify atol and rtol
5. Add the default metric for text-classification task
6. Remove the useless ckpt conversion method in bert to avoid the requirement of tf when loading modeling_bert
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10430764