modelscope

mirror of https://github.com/modelscope/modelscope.git synced 2025-12-16 16:27:45 +01:00

Author	SHA1	Message	Date
Xingjun.Wang	210ab40c54	Upgrade datasets (#921 ) * del _datasets_server import in hf_dataset_util * fix streaming for youku-mplug and adopt latest datasets * fix download config copy * update ut * add youku in test_general_datasets * update UT for general dataset * adapt to datasets version: 2.19.0 or later * add assert for youku data UT * fix disable_tqdm in some functions for 2.19.0 or later * update get_module_with_script * set trust_remote_code is True in load_dataset_with_ctx * update print info * update requirements for datasets version restriction * fix _dataset_info * add pillow * update comments * update comment * reuse _download function in DataDownloadManager * remove unused code * update test_run_modelhub in Human3DAnimationTest * set datasets>=2.18.0	2024-07-23 22:26:12 +08:00
xingjun.wxj	f2640a5a12	fix private dataset auth issue 1. Fix private datasets auth issue 2. Add arg: token (optional) in MsDataset.load() for FlexTrain Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12721569	2023-05-24 19:48:20 +08:00
xingjun.wxj	e630621599	Virgo SDK supports odps data source 1. Support ODPS datasource for virgo sdk. 2. Adapt "inner_url" parser for single-modal and multi-modal datasets. Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12583695 * add virgo sdk odps data * refine virgo sdk args and odps fetch batch data pipeline * add inner import for odps * del import VirgoDataset in MsDataset * fix import VirgoDataset * support inner url downloading * refine dataset.py and maxcompute_utils.py * add ut for virgo odps data batch * reset unifold ut level as 1 * refine virgo batch data	2023-05-12 17:27:19 +08:00
jiangnana.jnn	46072898da	remove easycv codes, plugin access Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11965727 * remove easycv codes * fix custome msdatasets import and remove metainfo * fix pipeline imports * fix pre-check * fix models import * fix pre-check * merge master	2023-05-09 17:58:01 +08:00
xingjun.wxj	e02a260c93	Refactor the task_datasets module Refactor the task_datasets module: 1. Add new module modelscope.msdatasets.dataset_cls.custom_datasets. 2. Add new function: modelscope.msdatasets.ms_dataset.MsDataset.to_custom_dataset(). 2. Add calling to_custom_dataset() func in MsDataset.load() to adapt new custom_datasets module. 3. Refactor the pipeline for loading custom dataset: 1) Only use MsDataset.load() function to load the custom datasets. 2) Combine MsDataset.load() with class EpochBasedTrainer. 4. Add new entry func for building datasets in EpochBasedTrainer: see modelscope.trainers.trainer.EpochBasedTrainer.build_dataset() 5. Add new func to build the custom dataset from model configuration, see: modelscope.trainers.trainer.EpochBasedTrainer.build_dataset_from_cfg() 6. Add new registry function for building custom datasets, see: modelscope.msdatasets.dataset_cls.custom_datasets.builder.build_custom_dataset() 7. Refine the class SiameseUIETrainer to adapt the new custom_datasets module. 8. Add class TorchCustomDataset as a superclass for custom datasets classes. 9. To move modules/classes/functions: 1) Move module msdatasets.audio to custom_datasets 2) Move module msdatasets.cv to custom_datasets 3) Move module bad_image_detecting to custom_datasets 4) Move module damoyolo to custom_datasets 5) Move module face_2d_keypoints to custom_datasets 6) Move module hand_2d_keypoints to custom_datasets 7) Move module human_wholebody_keypoint to custom_datasets 8) Move module image_classification to custom_datasets 9) Move module image_inpainting to custom_datasets 10) Move module image_portrait_enhancement to custom_datasets 11) Move module image_quality_assessment_degradation to custom_datasets 12) Move module image_quality_assmessment_mos to custom_datasets 13) Move class LanguageGuidedVideoSummarizationDataset to custom_datasets 14) Move class MGeoRankingDataset to custom_datasets 15) Move module movie_scene_segmentation custom_datasets 16) Move module object_detection to custom_datasets 17) Move module referring_video_object_segmentation to custom_datasets 18) Move module sidd_image_denoising to custom_datasets 19) Move module video_frame_interpolation to custom_datasets 20) Move module video_stabilization to custom_datasets 21) Move module video_super_resolution to custom_datasets 22) Move class GoproImageDeblurringDataset to custom_datasets 23) Move class EasyCVBaseDataset to custom_datasets 24) Move class ImageInstanceSegmentationCocoDataset to custom_datasets 25) Move class RedsImageDeblurringDataset to custom_datasets 26) Move class TextRankingDataset to custom_datasets 27) Move class VecoDataset to custom_datasets 28) Move class VideoSummarizationDataset to custom_datasets 10. To delete modules/functions/classes: 1) Del module task_datasets 2) Del to_task_dataset() in EpochBasedTrainer 3) Del build_dataset() in EpochBasedTrainer and renew a same name function. 11. Rename class Datasets to CustomDatasets in metainfo.py Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11872747	2023-03-10 09:03:32 +08:00
xingjun.wxj	99e58c6ea3	[to #42322933 ] fix some feedback issues 1. Fix the conflict between local path and remote dataset name in the form of dataset_name='namespace/dataset_name' in MsDataset.load() function. 2. Modify the obj_key.startswith value in get_split_objects_map function to adapt to dir name 'xxx/' format. Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11820290 * fix the conflict between local path and namespace/dataset_name of the dataset_name * fix function: get_split_objects_map * add UT for loading local csv file * add new test case for test_load_local_csv function	2023-03-01 20:13:31 +08:00
jiangyu.xzy	9f1b767ecd	asr训练dataset & 单独vad模型推理 & 多模型组合的asr推理 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11610526 * add asr dataset * sv num_workers 0 * add vad pipeline * add flexible vad/punc/lm model inputs	2023-02-10 05:09:00 +00:00
xingjun.wxj	f5745d869d	[to #42322933 ] modify test_level of UT `test_to_ms_dataset` to avoid connection timeout issue	2023-02-05 12:20:19 +00:00
xingjun.wxj	196489a2c2	[to #42322933 ] fix test issues 1. support the form of '/to/path/abc.csv' in MsDataset.load() function 2. fix the compatibility issue of datasets 3. modify the resumable cache path for oss utils 4. add UT cases	2023-01-13 14:57:16 +00:00
xingjun.wxj	43edddd31f	[to #42322933 ] msdataset module refactor and add 1230 features 1. 优化本地数据集加载链路 2. local与remote解耦，无网络环境下也可以使用SDK 3. 升级hf datasets及其相关依赖到最新版(2.7.0+) 4. 解决元数据感知不到数据文件变更的问题 5. 系统分层设计 6. 本地缓存管理问题 7. 优化error log输出信息 8. 支持streaming load * a. 支持数据文件为zip格式的streaming * b. 支持Image/Text/Audio/Biodata等格式数据集的iter * c. 兼容训练数据在meta中的历史数据集的streaming load * d. 支持数据文件为文件夹格式的streaming load 9. finetune任务串接进一步规范 * a. 避免出现to_hf_dataset这种使用，将常用的tf相关的func封装起来 * b. 去掉了跟hf混用的一些逻辑，统一包装到MsDataset里面 10. 超大数据集场景优化 * a. list oss objects: 直接拉取meta中的csv mapping，不需要做 list_oss_objects的api调用（前述提交已实现） * b. 优化sts过期加载问题（前述提交已实现） 11. 支持dataset_name格式为：namespace/dataset_name的输入方式参考Aone链接： https://aone.alibaba-inc.com/v2/project/1162242/task/46262894 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11264406	2023-01-10 07:01:34 +08:00
yuze.zyz	bb5512d1ab	[to #42322933 ] Refactor NLP and fix some user feedbacks 1. Abstract keys of dicts needed by nlp metric classes into the init method 2. Add Preprocessor.save_pretrained to save preprocessor information 3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training. 4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead 5. Use model/preprocessor's from_pretrained in all nlp pipeline classes. 6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes 7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes 8. Fix user feedback: Re-train the model in continue training scenario 9. Fix user feedback: Too many checkpoint saved 10. Simplify the nlp-trainer 11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override 12. Add safe_get to Config class ---------------------------- Another refactor from version 36 ------------------------- 13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example: TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor 14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors 15. Add output classes of nlp models 16. Refactor the logic for token-classification 17. Fix bug: checkpoint_hook does not support pytorch_model.pt 18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513 * add save_pretrained to preprocessor * save preprocessor config in hook * refactor label-id mapping fetching logic * test ok on sentence-similarity * run on finetuning * fix bug * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/nlp/nlp_base.py * add params to init * 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics * Split trainer init impls to overridable methods * remove some obsolete tokenizers * unfinished * support input params in pipeline * fix bugs * fix ut bug * fix bug * fix ut bug * fix ut bug * fix ut bug * add base class for some preprocessors * Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config * compatible with old code * fix ut bug * fix ut bugs * fix bug * add some comments * fix ut bug * add a requirement * fix pre-commit * Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config * fixbug * Support function type in registry * fix ut bug * fix bug * Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config # Conflicts: # modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/utils/hub.py * remove obsolete file * rename init args * rename params * fix merge bug * add default preprocessor config for ner-model * move a method a util file * remove unused config * Fix a bug in pbar * bestckptsaver:change default ckpt numbers to 1 * 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name * Fix bug * fix bug * fix bug * unfinished refactoring * unfinished * uw * uw * uw * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer # Conflicts: # modelscope/preprocessors/nlp/document_segmentation_preprocessor.py # modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py # modelscope/preprocessors/nlp/relation_extraction_preprocessor.py # modelscope/preprocessors/nlp/text_generation_preprocessor.py * uw * uw * unify nlp task outputs * uw * uw * uw * uw * change the order of text cls pipeline * refactor t5 * refactor tg task preprocessor * fix * unfinished * temp * refactor code * unfinished * unfinished * unfinished * unfinished * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer * smoke test pass * ut testing * pre-commit passed * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/nlp/bert/document_segmentation.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py * merge master * unifnished * Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config * fix bug * fix ut bug * support ner batch inference * fix ut bug * fix bug * support batch inference on three nlp tasks * unfinished * fix bug * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/base/base_model.py # modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py # modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py # modelscope/pipelines/nlp/dialog_modeling_pipeline.py # modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py # modelscope/pipelines/nlp/faq_question_answering_pipeline.py # modelscope/pipelines/nlp/feature_extraction_pipeline.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/information_extraction_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/sentence_embedding_pipeline.py # modelscope/pipelines/nlp/summarization_pipeline.py # modelscope/pipelines/nlp/table_question_answering_pipeline.py # modelscope/pipelines/nlp/text2text_generation_pipeline.py # modelscope/pipelines/nlp/text_classification_pipeline.py # modelscope/pipelines/nlp/text_error_correction_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/text_ranking_pipeline.py # modelscope/pipelines/nlp/token_classification_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/trainers/nlp_trainer.py * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/__init__.py * fix bug * fix bug * fix bug * fix bug * fix bug * fixbug * pre-commit passed * fix bug * fixbug * fix bug * fix bug * fix bug * fix bug * self review done * fixbug * fix bug * fix bug * fix bugs * remove sub-token offset mapping * fix name bug * add some tests * 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs * add old logic back * tmp save * add tokenize by words logic back * move outputs file back * revert veco token-classification back * fix typo * Fix description * Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/pipelines/builder.py	2022-11-30 23:52:17 +08:00
yuze.zyz	707cbef013	[to #42322933 ]Fix bug in daily UT Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10491891	2022-10-22 23:25:18 +08:00
yuze.zyz	acba1786b0	[to #42322933 ] Fix bug in UT daily 1. Fix bugs in daily test 2. Fix a bug that the updating of lr is before the first time of updating of optimizer TODO this will still cause warnings when GA is above 1 3. Remove the judgement of mode in text-classification's preprocessor to fit the base trainer(Bug) Update some regression bins to fit the preprocessor 4. Update the regression tool to let outer code modify atol and rtol 5. Add the default metric for text-classification task 6. Remove the useless ckpt conversion method in bert to avoid the requirement of tf when loading modeling_bert Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10430764	2022-10-20 15:29:34 +08:00
zhangzhicheng.zzc	d721fabb34	[to #42322933 ]bert with sequence classification / token classification/ fill mask refactor 1.新增支持原始bert模型（非easynlp的 backbone prefix版本） 2.支持bert的在sequence classification/fill mask /token classification上的backbone head形式 3.统一了sequence classification几个任务的pipeline到一个类 4.fill mask 支持backbone head形式 5.token classification的几个子任务（ner，word seg， part of speech）的preprocessor 统一到了一起TokenClassificationPreprocessor 6. sequence classification的几个子任务（single classification， pair classification）的preprocessor 统一到了一起SequenceClassificationPreprocessor 7. 改动register中 cls的group_key 赋值位置，之前的group_key在多个decorators的情况下，会被覆盖，obj_cls的group_key信息不正确 8. 基于backbone head形式将原本group_key和 module同名的情况尝试做调整，如下在modelscope/pipelines/nlp/sequence_classification_pipeline.py 中原本 @PIPELINES.register_module( Tasks.sentiment_classification, module_name=Pipelines.sentiment_classification) 改成 @PIPELINES.register_module( Tasks.text_classification, module_name=Pipelines.sentiment_classification) 相应的configuration.json也有改动，这样的改动更符合任务和pipline（子任务）的关系。 8. 其他相应改动为支持上述功能 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10041463	2022-09-27 23:08:33 +08:00
wenmeng.zwm	6808e9a301	[to #44902099 ] add license for framework files Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10189613	2022-09-20 17:49:31 +08:00
shuying.shu	a9deb3895c	[to #42322933 ] movie scene segmentation模型接入 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9872869	2022-08-31 20:54:20 +08:00
feiwu.yfw	2b64cf2bb6	[to #42322933 ]支持从dataset json文件中获取参数 * dataset json file add args	2022-08-30 15:15:15 +08:00
feiwu.yfw	39485426e7	[to #42322933 ]:fix msdataset * 修复了zip文件不同打包模式下返回路径错误问题。 * 修复了替换了数据集文件重新下载时校验失败问题。 * 修复dataset oss文件在 REUSE 模式下重复下载的问题。 * 修复了csv数据集的meta json文件中某个split的meta和file字段都为''时加载所有split失败的问题。 * 修复了不同版本datasets路径不一致的问题。	2022-08-26 22:41:13 +08:00
xingjun.wxj	44033290d4	[to #42322933 ]MsDataset 支持上传数据集压缩包和meta 1. MsDataset支持upload数据文件(压缩包) 2. MsDataset支持clone和upload meta data 3. 使用MsDataset.load()下载数据集，支持web端显示数据集下载计数 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9831232	2022-08-25 22:28:10 +08:00
feiwu.yfw	35548bd492	[to #43875101 ] msdataset add coco dataset unify taskdataset and ms dataset fix hf datasets	2022-08-17 22:51:22 +08:00
feiwu.yfw	743e876981	[to #43660556 ] msdataset数据集加载 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9552632 * load csv dataset from modelscoop	2022-07-29 12:22:48 +08:00
feiwu.yfw	2c3875c0e1	[to #43299989 ] Fix msdataset * fix msdataset Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9436292 * fix msdataset	2022-07-20 16:38:15 +08:00
feiwu.yfw	5da470fd5d	[to #42791465 , #42779255 , #42777959 , #42757844 , #42756050 , #42746916 , #42743595 , #42791863 ] fix: fix msdataset Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9174075 * fix msdataset	2022-06-28 20:40:57 +08:00
yingda.chen	6702b29e21	[to #42794773 ]rename pydataset to msdataset Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9165402	2022-06-27 11:09:38 +08:00

24 Commits