Commit Graph

24 Commits

Author SHA1 Message Date
Xingjun.Wang
210ab40c54 Upgrade datasets (#921)
* del _datasets_server import in hf_dataset_util

* fix streaming for youku-mplug and adopt latest datasets

* fix download config copy

* update ut

* add youku in test_general_datasets

* update UT for general dataset

* adapt to datasets version: 2.19.0 or later

* add assert for youku data UT

* fix disable_tqdm in some functions for 2.19.0 or later

* update get_module_with_script

* set trust_remote_code is True in load_dataset_with_ctx

* update print info

* update requirements for datasets version restriction

* fix _dataset_info

* add pillow

* update comments

* update comment

* reuse _download function in DataDownloadManager

* remove unused code

* update test_run_modelhub in Human3DAnimationTest

* set datasets>=2.18.0
2024-07-23 22:26:12 +08:00
xingjun.wxj
f2640a5a12 fix private dataset auth issue
1. Fix private datasets auth issue
2. Add arg: token (optional) in MsDataset.load() for FlexTrain
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12721569
2023-05-24 19:48:20 +08:00
xingjun.wxj
e630621599 Virgo SDK supports odps data source
1. Support ODPS datasource for virgo sdk.
2. Adapt "inner_url" parser for single-modal and multi-modal datasets.

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12583695

* add virgo sdk odps data

* refine virgo sdk args and odps fetch batch data pipeline

* add inner import for odps

* del import VirgoDataset in MsDataset

* fix import VirgoDataset

* support inner url downloading

* refine dataset.py and maxcompute_utils.py

* add ut for virgo odps data batch

* reset unifold ut level as 1

* refine virgo batch data
2023-05-12 17:27:19 +08:00
jiangnana.jnn
46072898da remove easycv codes, plugin access
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11965727

* remove easycv codes

* fix custome msdatasets import and remove metainfo

* fix pipeline imports

* fix pre-check

* fix models import

* fix pre-check

* merge master
2023-05-09 17:58:01 +08:00
xingjun.wxj
e02a260c93 Refactor the task_datasets module
Refactor the task_datasets module:

1. Add new module modelscope.msdatasets.dataset_cls.custom_datasets.
2. Add new function: modelscope.msdatasets.ms_dataset.MsDataset.to_custom_dataset().
2. Add calling to_custom_dataset() func in MsDataset.load() to adapt new custom_datasets module.
3. Refactor the pipeline for loading custom dataset: 
	1) Only use MsDataset.load() function to load the custom datasets.
	2) Combine MsDataset.load() with class EpochBasedTrainer.
4. Add new entry func for building datasets in EpochBasedTrainer: see modelscope.trainers.trainer.EpochBasedTrainer.build_dataset()
5. Add new func to build the custom dataset from model configuration, see: modelscope.trainers.trainer.EpochBasedTrainer.build_dataset_from_cfg()
6. Add new registry function for building custom datasets, see: modelscope.msdatasets.dataset_cls.custom_datasets.builder.build_custom_dataset()
7. Refine the class SiameseUIETrainer to adapt the new custom_datasets module.
8. Add class TorchCustomDataset as a superclass for custom datasets classes.
9. To move modules/classes/functions:
	1) Move module msdatasets.audio to custom_datasets
	2) Move module msdatasets.cv to custom_datasets
	3) Move module bad_image_detecting to custom_datasets
	4) Move module damoyolo to custom_datasets
	5) Move module face_2d_keypoints to custom_datasets
	6) Move module hand_2d_keypoints to custom_datasets
	7) Move module human_wholebody_keypoint to custom_datasets
	8) Move module image_classification to custom_datasets
	9) Move module image_inpainting to custom_datasets
	10) Move module image_portrait_enhancement to custom_datasets
	11) Move module image_quality_assessment_degradation to custom_datasets
	12) Move module image_quality_assmessment_mos to custom_datasets
	13) Move class LanguageGuidedVideoSummarizationDataset to custom_datasets
	14) Move class MGeoRankingDataset to custom_datasets
	15) Move module movie_scene_segmentation custom_datasets
	16) Move module object_detection to custom_datasets
	17) Move module referring_video_object_segmentation to custom_datasets
	18) Move module sidd_image_denoising to custom_datasets
	19) Move module video_frame_interpolation to custom_datasets
	20) Move module video_stabilization to custom_datasets
	21) Move module video_super_resolution to custom_datasets
	22) Move class GoproImageDeblurringDataset to custom_datasets
	23) Move class EasyCVBaseDataset to custom_datasets
	24) Move class ImageInstanceSegmentationCocoDataset to custom_datasets
	25) Move class RedsImageDeblurringDataset to custom_datasets
	26) Move class TextRankingDataset to custom_datasets
	27) Move class VecoDataset to custom_datasets
	28) Move class VideoSummarizationDataset to custom_datasets
10. To delete modules/functions/classes:
	1) Del module task_datasets
	2) Del to_task_dataset() in EpochBasedTrainer
	3) Del build_dataset() in EpochBasedTrainer and renew a same name function.
11. Rename class Datasets to CustomDatasets in metainfo.py

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11872747
2023-03-10 09:03:32 +08:00
xingjun.wxj
99e58c6ea3 [to #42322933] fix some feedback issues
1. Fix the conflict between local path and remote dataset name in the form of dataset_name='namespace/dataset_name' in MsDataset.load() function.
2. Modify the obj_key.startswith value in get_split_objects_map function to adapt to dir name 'xxx/' format.

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11820290

* fix the conflict between local path and namespace/dataset_name of the dataset_name

* fix function: get_split_objects_map

* add UT for loading local csv file

* add new test case for test_load_local_csv function
2023-03-01 20:13:31 +08:00
jiangyu.xzy
9f1b767ecd asr训练dataset & 单独vad模型推理 & 多模型组合的asr推理
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11610526

* add asr dataset

* sv num_workers 0

* add vad pipeline

* add flexible vad/punc/lm model inputs
2023-02-10 05:09:00 +00:00
xingjun.wxj
f5745d869d [to #42322933] modify test_level of UT test_to_ms_dataset to avoid connection timeout issue 2023-02-05 12:20:19 +00:00
xingjun.wxj
196489a2c2 [to #42322933] fix test issues
1. support the form of '/to/path/abc.csv' in MsDataset.load() function
2. fix the compatibility issue of datasets
3. modify the resumable cache path for oss utils
4. add UT cases
2023-01-13 14:57:16 +00:00
xingjun.wxj
43edddd31f [to #42322933] msdataset module refactor and add 1230 features
1. 优化本地数据集加载链路  
2. local与remote解耦,无网络环境下也可以使用SDK  
3. 升级hf datasets及其相关依赖到最新版(2.7.0+)
4. 解决元数据感知不到数据文件变更的问题  
5. 系统分层设计
6. 本地缓存管理问题  
7. 优化error log输出信息  
8. 支持streaming load	
* a. 支持数据文件为zip格式的streaming
* b. 支持Image/Text/Audio/Biodata等格式数据集的iter
* c. 兼容训练数据在meta中的历史数据集的streaming load
* d. 支持数据文件为文件夹格式的streaming load

9. finetune任务串接进一步规范
* a. 避免出现to_hf_dataset这种使用,将常用的tf相关的func封装起来  
* b. 去掉了跟hf混用的一些逻辑,统一包装到MsDataset里面

10. 超大数据集场景优化
* a. list oss objects: 直接拉取meta中的csv mapping,不需要做 list_oss_objects的api调用(前述提交已实现)
* b. 优化sts过期加载问题(前述提交已实现)

11. 支持dataset_name格式为:namespace/dataset_name的输入方式

参考Aone链接: https://aone.alibaba-inc.com/v2/project/1162242/task/46262894
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11264406
2023-01-10 07:01:34 +08:00
yuze.zyz
bb5512d1ab [to #42322933] Refactor NLP and fix some user feedbacks
1. Abstract keys of dicts needed by nlp metric classes into the init method
2. Add Preprocessor.save_pretrained to save preprocessor information
3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training.
4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead
5. Use model/preprocessor's from_pretrained in all nlp pipeline classes.
6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes
7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes
8. Fix user feedback: Re-train the model in continue training scenario
9. Fix user feedback: Too many checkpoint saved
10. Simplify the nlp-trainer
11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override
12. Add safe_get to Config class

----------------------------  Another refactor from version 36 -------------------------

13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example:
      TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor
14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors
15. Add output classes of nlp models
16. Refactor the logic for token-classification
17. Fix bug: checkpoint_hook does not support pytorch_model.pt
18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training
       NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513

    * add save_pretrained to preprocessor

* save preprocessor config in hook

* refactor label-id mapping fetching logic

* test ok on sentence-similarity

* run on finetuning

* fix bug

* pre-commit passed

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/preprocessors/nlp/nlp_base.py

* add params to init

* 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics

* Split trainer init impls to overridable methods

* remove some obsolete tokenizers

* unfinished

* support input params in pipeline

* fix bugs

* fix ut bug

* fix bug

* fix ut bug

* fix ut bug

* fix ut bug

* add base class for some preprocessors

* Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config

* compatible with old code

* fix ut bug

* fix ut bugs

* fix bug

* add some comments

* fix ut bug

* add a requirement

* fix pre-commit

* Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config

* fixbug

* Support function type in registry

* fix ut bug

* fix bug

* Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config

# Conflicts:
#	modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/utils/hub.py

* remove obsolete file

* rename init args

* rename params

* fix merge bug

* add default preprocessor config for ner-model

* move a method a util file

* remove unused config

* Fix a bug in pbar

* bestckptsaver:change default ckpt numbers to 1

* 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name

* Fix bug

* fix bug

* fix bug

* unfinished refactoring

* unfinished

* uw

* uw

* uw

* uw

* Merge branch 'feat/refactor_config' into feat/refactor_trainer

# Conflicts:
#	modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
#	modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
#	modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
#	modelscope/preprocessors/nlp/text_generation_preprocessor.py

* uw

* uw

* unify nlp task outputs

* uw

* uw

* uw

* uw

* change the order of text cls pipeline

* refactor t5

* refactor tg task preprocessor

* fix

* unfinished

* temp

* refactor code

* unfinished

* unfinished

* unfinished

* unfinished

* uw

* Merge branch 'feat/refactor_config' into feat/refactor_trainer

* smoke test pass

* ut testing

* pre-commit passed

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/models/nlp/bert/document_segmentation.py
#	modelscope/pipelines/nlp/__init__.py
#	modelscope/pipelines/nlp/document_segmentation_pipeline.py

* merge master

* unifnished

* Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config

* fix bug

* fix ut bug

* support ner batch inference

* fix ut bug

* fix bug

* support batch inference on three nlp tasks

* unfinished

* fix bug

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/models/base/base_model.py
#	modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
#	modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
#	modelscope/pipelines/nlp/dialog_modeling_pipeline.py
#	modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
#	modelscope/pipelines/nlp/document_segmentation_pipeline.py
#	modelscope/pipelines/nlp/faq_question_answering_pipeline.py
#	modelscope/pipelines/nlp/feature_extraction_pipeline.py
#	modelscope/pipelines/nlp/fill_mask_pipeline.py
#	modelscope/pipelines/nlp/information_extraction_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/sentence_embedding_pipeline.py
#	modelscope/pipelines/nlp/summarization_pipeline.py
#	modelscope/pipelines/nlp/table_question_answering_pipeline.py
#	modelscope/pipelines/nlp/text2text_generation_pipeline.py
#	modelscope/pipelines/nlp/text_classification_pipeline.py
#	modelscope/pipelines/nlp/text_error_correction_pipeline.py
#	modelscope/pipelines/nlp/text_generation_pipeline.py
#	modelscope/pipelines/nlp/text_ranking_pipeline.py
#	modelscope/pipelines/nlp/token_classification_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
#	modelscope/trainers/nlp_trainer.py

* pre-commit passed

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/preprocessors/__init__.py

* fix bug

* fix bug

* fix bug

* fix bug

* fix bug

* fixbug

* pre-commit passed

* fix bug

* fixbug

* fix bug

* fix bug

* fix bug

* fix bug

* self review done

* fixbug

* fix bug

* fix bug

* fix bugs

* remove sub-token offset mapping

* fix name bug

* add some tests

* 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs

* add old logic back

* tmp save

* add tokenize by words logic back

* move outputs file back

* revert veco token-classification back

* fix typo

* Fix description

* Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/pipelines/builder.py
2022-11-30 23:52:17 +08:00
yuze.zyz
707cbef013 [to #42322933]Fix bug in daily UT
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10491891
2022-10-22 23:25:18 +08:00
yuze.zyz
acba1786b0 [to #42322933] Fix bug in UT daily
1. Fix bugs in daily test
2. Fix a bug that the updating of lr is before the first time of updating of optimizer
    TODO this will still cause warnings when GA is above 1
3. Remove the judgement of mode in text-classification's preprocessor to fit the base trainer(Bug)
     Update some regression bins to fit the preprocessor
4. Update the regression tool to let outer code modify atol and rtol
5. Add the default metric for text-classification task
6. Remove the useless ckpt conversion method in bert to avoid the requirement of tf when loading modeling_bert
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10430764
2022-10-20 15:29:34 +08:00
zhangzhicheng.zzc
d721fabb34 [to #42322933]bert with sequence classification / token classification/ fill mask refactor
1.新增支持原始bert模型(非easynlp的 backbone prefix版本)
2.支持bert的在sequence classification/fill mask /token classification上的backbone head形式
3.统一了sequence classification几个任务的pipeline到一个类
4.fill mask 支持backbone head形式
5.token classification的几个子任务(ner,word seg, part of speech)的preprocessor 统一到了一起TokenClassificationPreprocessor
6. sequence classification的几个子任务(single classification, pair classification)的preprocessor 统一到了一起SequenceClassificationPreprocessor
7. 改动register中 cls的group_key 赋值位置,之前的group_key在多个decorators的情况下,会被覆盖,obj_cls的group_key信息不正确
8. 基于backbone head形式将 原本group_key和 module同名的情况尝试做调整,如下在modelscope/pipelines/nlp/sequence_classification_pipeline.py 中 
原本
 @PIPELINES.register_module(
    Tasks.sentiment_classification, module_name=Pipelines.sentiment_classification)
改成
@PIPELINES.register_module(
    Tasks.text_classification, module_name=Pipelines.sentiment_classification)
相应的configuration.json也有改动,这样的改动更符合任务和pipline(子任务)的关系。
8. 其他相应改动为支持上述功能
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10041463
2022-09-27 23:08:33 +08:00
wenmeng.zwm
6808e9a301 [to #44902099] add license for framework files
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10189613
2022-09-20 17:49:31 +08:00
shuying.shu
a9deb3895c [to #42322933] movie scene segmentation模型接入
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9872869
2022-08-31 20:54:20 +08:00
feiwu.yfw
2b64cf2bb6 [to #42322933]支持从dataset json文件中获取参数
* dataset json file add args
2022-08-30 15:15:15 +08:00
feiwu.yfw
39485426e7 [to #42322933]:fix msdataset
* 修复了zip文件不同打包模式下返回路径错误问题。
* 修复了替换了数据集文件重新下载时校验失败问题。
* 修复dataset oss文件在 REUSE 模式下重复下载的问题。
* 修复了csv数据集的meta json文件中某个split的meta和file字段都为''时加载所有split失败的问题。
 * 修复了不同版本datasets路径不一致的问题。
2022-08-26 22:41:13 +08:00
xingjun.wxj
44033290d4 [to #42322933]MsDataset 支持上传数据集压缩包和meta
1. MsDataset支持upload数据文件(压缩包)
2. MsDataset支持clone和upload meta data
3. 使用MsDataset.load()下载数据集,支持web端显示数据集下载计数
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9831232
2022-08-25 22:28:10 +08:00
feiwu.yfw
35548bd492 [to #43875101]
msdataset add coco dataset
unify taskdataset and ms dataset
fix hf datasets
2022-08-17 22:51:22 +08:00
feiwu.yfw
743e876981 [to #43660556] msdataset数据集加载
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9552632

* load csv dataset from modelscoop
2022-07-29 12:22:48 +08:00
feiwu.yfw
2c3875c0e1 [to #43299989] Fix msdataset
* fix msdataset
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9436292

    * fix msdataset
2022-07-20 16:38:15 +08:00
feiwu.yfw
5da470fd5d [to #42791465, #42779255, #42777959, #42757844, #42756050, #42746916, #42743595, #42791863] fix: fix msdataset
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9174075

* fix msdataset
2022-06-28 20:40:57 +08:00
yingda.chen
6702b29e21 [to #42794773]rename pydataset to msdataset
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9165402
2022-06-27 11:09:38 +08:00