modelscope

mirror of https://github.com/modelscope/modelscope.git synced 2025-12-19 01:29:24 +01:00

Author	SHA1	Message	Date
xingjun.wxj	99e58c6ea3	[to #42322933 ] fix some feedback issues 1. Fix the conflict between local path and remote dataset name in the form of dataset_name='namespace/dataset_name' in MsDataset.load() function. 2. Modify the obj_key.startswith value in get_split_objects_map function to adapt to dir name 'xxx/' format. Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11820290 * fix the conflict between local path and namespace/dataset_name of the dataset_name * fix function: get_split_objects_map * add UT for loading local csv file * add new test case for test_load_local_csv function	2023-03-01 20:13:31 +08:00
jiangyu.xzy	9f1b767ecd	asr训练dataset & 单独vad模型推理 & 多模型组合的asr推理 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11610526 * add asr dataset * sv num_workers 0 * add vad pipeline * add flexible vad/punc/lm model inputs	2023-02-10 05:09:00 +00:00
xingjun.wxj	f5745d869d	[to #42322933 ] modify test_level of UT `test_to_ms_dataset` to avoid connection timeout issue	2023-02-05 12:20:19 +00:00
xingjun.wxj	196489a2c2	[to #42322933 ] fix test issues 1. support the form of '/to/path/abc.csv' in MsDataset.load() function 2. fix the compatibility issue of datasets 3. modify the resumable cache path for oss utils 4. add UT cases	2023-01-13 14:57:16 +00:00
xingjun.wxj	43edddd31f	[to #42322933 ] msdataset module refactor and add 1230 features 1. 优化本地数据集加载链路 2. local与remote解耦，无网络环境下也可以使用SDK 3. 升级hf datasets及其相关依赖到最新版(2.7.0+) 4. 解决元数据感知不到数据文件变更的问题 5. 系统分层设计 6. 本地缓存管理问题 7. 优化error log输出信息 8. 支持streaming load * a. 支持数据文件为zip格式的streaming * b. 支持Image/Text/Audio/Biodata等格式数据集的iter * c. 兼容训练数据在meta中的历史数据集的streaming load * d. 支持数据文件为文件夹格式的streaming load 9. finetune任务串接进一步规范 * a. 避免出现to_hf_dataset这种使用，将常用的tf相关的func封装起来 * b. 去掉了跟hf混用的一些逻辑，统一包装到MsDataset里面 10. 超大数据集场景优化 * a. list oss objects: 直接拉取meta中的csv mapping，不需要做 list_oss_objects的api调用（前述提交已实现） * b. 优化sts过期加载问题（前述提交已实现） 11. 支持dataset_name格式为：namespace/dataset_name的输入方式参考Aone链接： https://aone.alibaba-inc.com/v2/project/1162242/task/46262894 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11264406	2023-01-10 07:01:34 +08:00
yuze.zyz	bb5512d1ab	[to #42322933 ] Refactor NLP and fix some user feedbacks 1. Abstract keys of dicts needed by nlp metric classes into the init method 2. Add Preprocessor.save_pretrained to save preprocessor information 3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training. 4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead 5. Use model/preprocessor's from_pretrained in all nlp pipeline classes. 6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes 7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes 8. Fix user feedback: Re-train the model in continue training scenario 9. Fix user feedback: Too many checkpoint saved 10. Simplify the nlp-trainer 11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override 12. Add safe_get to Config class ---------------------------- Another refactor from version 36 ------------------------- 13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example: TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor 14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors 15. Add output classes of nlp models 16. Refactor the logic for token-classification 17. Fix bug: checkpoint_hook does not support pytorch_model.pt 18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513 * add save_pretrained to preprocessor * save preprocessor config in hook * refactor label-id mapping fetching logic * test ok on sentence-similarity * run on finetuning * fix bug * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/nlp/nlp_base.py * add params to init * 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics * Split trainer init impls to overridable methods * remove some obsolete tokenizers * unfinished * support input params in pipeline * fix bugs * fix ut bug * fix bug * fix ut bug * fix ut bug * fix ut bug * add base class for some preprocessors * Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config * compatible with old code * fix ut bug * fix ut bugs * fix bug * add some comments * fix ut bug * add a requirement * fix pre-commit * Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config * fixbug * Support function type in registry * fix ut bug * fix bug * Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config # Conflicts: # modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/utils/hub.py * remove obsolete file * rename init args * rename params * fix merge bug * add default preprocessor config for ner-model * move a method a util file * remove unused config * Fix a bug in pbar * bestckptsaver:change default ckpt numbers to 1 * 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name * Fix bug * fix bug * fix bug * unfinished refactoring * unfinished * uw * uw * uw * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer # Conflicts: # modelscope/preprocessors/nlp/document_segmentation_preprocessor.py # modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py # modelscope/preprocessors/nlp/relation_extraction_preprocessor.py # modelscope/preprocessors/nlp/text_generation_preprocessor.py * uw * uw * unify nlp task outputs * uw * uw * uw * uw * change the order of text cls pipeline * refactor t5 * refactor tg task preprocessor * fix * unfinished * temp * refactor code * unfinished * unfinished * unfinished * unfinished * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer * smoke test pass * ut testing * pre-commit passed * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/nlp/bert/document_segmentation.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py * merge master * unifnished * Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config * fix bug * fix ut bug * support ner batch inference * fix ut bug * fix bug * support batch inference on three nlp tasks * unfinished * fix bug * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/base/base_model.py # modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py # modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py # modelscope/pipelines/nlp/dialog_modeling_pipeline.py # modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py # modelscope/pipelines/nlp/faq_question_answering_pipeline.py # modelscope/pipelines/nlp/feature_extraction_pipeline.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/information_extraction_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/sentence_embedding_pipeline.py # modelscope/pipelines/nlp/summarization_pipeline.py # modelscope/pipelines/nlp/table_question_answering_pipeline.py # modelscope/pipelines/nlp/text2text_generation_pipeline.py # modelscope/pipelines/nlp/text_classification_pipeline.py # modelscope/pipelines/nlp/text_error_correction_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/text_ranking_pipeline.py # modelscope/pipelines/nlp/token_classification_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/trainers/nlp_trainer.py * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/__init__.py * fix bug * fix bug * fix bug * fix bug * fix bug * fixbug * pre-commit passed * fix bug * fixbug * fix bug * fix bug * fix bug * fix bug * self review done * fixbug * fix bug * fix bug * fix bugs * remove sub-token offset mapping * fix name bug * add some tests * 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs * add old logic back * tmp save * add tokenize by words logic back * move outputs file back * revert veco token-classification back * fix typo * Fix description * Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/pipelines/builder.py	2022-11-30 23:52:17 +08:00
yuze.zyz	707cbef013	[to #42322933 ]Fix bug in daily UT Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10491891	2022-10-22 23:25:18 +08:00
yuze.zyz	acba1786b0	[to #42322933 ] Fix bug in UT daily 1. Fix bugs in daily test 2. Fix a bug that the updating of lr is before the first time of updating of optimizer TODO this will still cause warnings when GA is above 1 3. Remove the judgement of mode in text-classification's preprocessor to fit the base trainer(Bug) Update some regression bins to fit the preprocessor 4. Update the regression tool to let outer code modify atol and rtol 5. Add the default metric for text-classification task 6. Remove the useless ckpt conversion method in bert to avoid the requirement of tf when loading modeling_bert Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10430764	2022-10-20 15:29:34 +08:00
zhangzhicheng.zzc	d721fabb34	[to #42322933 ]bert with sequence classification / token classification/ fill mask refactor 1.新增支持原始bert模型（非easynlp的 backbone prefix版本） 2.支持bert的在sequence classification/fill mask /token classification上的backbone head形式 3.统一了sequence classification几个任务的pipeline到一个类 4.fill mask 支持backbone head形式 5.token classification的几个子任务（ner，word seg， part of speech）的preprocessor 统一到了一起TokenClassificationPreprocessor 6. sequence classification的几个子任务（single classification， pair classification）的preprocessor 统一到了一起SequenceClassificationPreprocessor 7. 改动register中 cls的group_key 赋值位置，之前的group_key在多个decorators的情况下，会被覆盖，obj_cls的group_key信息不正确 8. 基于backbone head形式将原本group_key和 module同名的情况尝试做调整，如下在modelscope/pipelines/nlp/sequence_classification_pipeline.py 中原本 @PIPELINES.register_module( Tasks.sentiment_classification, module_name=Pipelines.sentiment_classification) 改成 @PIPELINES.register_module( Tasks.text_classification, module_name=Pipelines.sentiment_classification) 相应的configuration.json也有改动，这样的改动更符合任务和pipline（子任务）的关系。 8. 其他相应改动为支持上述功能 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10041463	2022-09-27 23:08:33 +08:00
wenmeng.zwm	6808e9a301	[to #44902099 ] add license for framework files Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10189613	2022-09-20 17:49:31 +08:00
shuying.shu	a9deb3895c	[to #42322933 ] movie scene segmentation模型接入 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9872869	2022-08-31 20:54:20 +08:00
feiwu.yfw	2b64cf2bb6	[to #42322933 ]支持从dataset json文件中获取参数 * dataset json file add args	2022-08-30 15:15:15 +08:00
feiwu.yfw	39485426e7	[to #42322933 ]:fix msdataset * 修复了zip文件不同打包模式下返回路径错误问题。 * 修复了替换了数据集文件重新下载时校验失败问题。 * 修复dataset oss文件在 REUSE 模式下重复下载的问题。 * 修复了csv数据集的meta json文件中某个split的meta和file字段都为''时加载所有split失败的问题。 * 修复了不同版本datasets路径不一致的问题。	2022-08-26 22:41:13 +08:00
xingjun.wxj	44033290d4	[to #42322933 ]MsDataset 支持上传数据集压缩包和meta 1. MsDataset支持upload数据文件(压缩包) 2. MsDataset支持clone和upload meta data 3. 使用MsDataset.load()下载数据集，支持web端显示数据集下载计数 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9831232	2022-08-25 22:28:10 +08:00
feiwu.yfw	35548bd492	[to #43875101 ] msdataset add coco dataset unify taskdataset and ms dataset fix hf datasets	2022-08-17 22:51:22 +08:00
feiwu.yfw	743e876981	[to #43660556 ] msdataset数据集加载 Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9552632 * load csv dataset from modelscoop	2022-07-29 12:22:48 +08:00
feiwu.yfw	2c3875c0e1	[to #43299989 ] Fix msdataset * fix msdataset Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9436292 * fix msdataset	2022-07-20 16:38:15 +08:00
feiwu.yfw	5da470fd5d	[to #42791465 , #42779255 , #42777959 , #42757844 , #42756050 , #42746916 , #42743595 , #42791863 ] fix: fix msdataset Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9174075 * fix msdataset	2022-06-28 20:40:57 +08:00
yingda.chen	6702b29e21	[to #42794773 ]rename pydataset to msdataset Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9165402	2022-06-27 11:09:38 +08:00

19 Commits