examples/pytorch/text_classification/finetune_text_classification.py

import os
from dataclasses import dataclass, field

from modelscope import (EpochBasedTrainer, MsDataset, TrainingArgs,
                        build_dataset_from_file)
from modelscope.trainers import build_trainer


def set_labels(labels):
    if isinstance(labels, str):
        labels = labels.split(',')
    return {label: id for id, label in enumerate(labels)}


@dataclass(init=False)
class TextClassificationArguments(TrainingArgs):

    first_sequence: str = field(
        default=None,
        metadata={
            'help': 'The first sequence key of preprocessor',
            'cfg_node': 'preprocessor.first_sequence'
        })

    second_sequence: str = field(
        default=None,
        metadata={
            'help': 'The second sequence key of preprocessor',
            'cfg_node': 'preprocessor.second_sequence'
        })

    label: str = field(
        default=None,
        metadata={
            'help': 'The label key of preprocessor',
            'cfg_node': 'preprocessor.label'
        })

    labels: str = field(
        default=None,
        metadata={
            'help': 'The labels of the dataset',
            'cfg_node': 'preprocessor.label2id',
            'cfg_setter': set_labels,
        })

    preprocessor: str = field(
        default=None,
        metadata={
            'help': 'The preprocessor type',
            'cfg_node': 'preprocessor.type'
        })


config, args = TextClassificationArguments().parse_cli().to_config()

print(config, args)


def cfg_modify_fn(cfg):
    if args.use_model_config:
        cfg.merge_from_dict(config)
    else:
        cfg = config
    cfg.model['num_labels'] = len(cfg.preprocessor.label2id)
    if cfg.train.lr_scheduler.type == 'LinearLR':
        cfg.train.lr_scheduler['total_iters'] = \
            int(len(train_dataset) / cfg.train.dataloader.batch_size_per_gpu) * cfg.train.max_epochs
    return cfg


if args.dataset_json_file is None:
    train_dataset = MsDataset.load(
        args.train_dataset_name,
        subset_name=args.train_subset_name,
        split=args.train_split,
        namespace=args.train_dataset_namespace)
    validation_dataset = MsDataset.load(
        args.val_dataset_name,
        subset_name=args.val_subset_name,
        split=args.val_split,
        namespace=args.val_dataset_namespace)
else:
    train_dataset, validation_dataset = build_dataset_from_file(
        args.dataset_json_file)

kwargs = dict(
    model=args.model,
    model_revision=args.model_revision,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    seed=args.seed,
    cfg_modify_fn=cfg_modify_fn)

os.environ['LOCAL_RANK'] = str(args.local_rank)
trainer: EpochBasedTrainer = build_trainer(name='trainer', default_args=kwargs)
trainer.train()
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00			`import os`
			`from dataclasses import dataclass, field`

Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`from modelscope import (EpochBasedTrainer, MsDataset, TrainingArgs,`
update multi_modal_embedding example Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12626062 2023-05-16 14:31:26 +08:00			`build_dataset_from_file)`
			`from modelscope.trainers import build_trainer`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00

Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`def set_labels(labels):`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00			`if isinstance(labels, str):`
			`labels = labels.split(',')`
Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`return {label: id for id, label in enumerate(labels)}`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00

Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`@dataclass(init=False)`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00			`class TextClassificationArguments(TrainingArgs):`

			`first_sequence: str = field(`
			`default=None,`
			`metadata={`
			`'help': 'The first sequence key of preprocessor',`
			`'cfg_node': 'preprocessor.first_sequence'`
			`})`

			`second_sequence: str = field(`
			`default=None,`
			`metadata={`
			`'help': 'The second sequence key of preprocessor',`
			`'cfg_node': 'preprocessor.second_sequence'`
			`})`

			`label: str = field(`
			`default=None,`
			`metadata={`
			`'help': 'The label key of preprocessor',`
			`'cfg_node': 'preprocessor.label'`
			`})`

			`labels: str = field(`
			`default=None,`
			`metadata={`
			`'help': 'The labels of the dataset',`
			`'cfg_node': 'preprocessor.label2id',`
			`'cfg_setter': set_labels,`
			`})`

			`preprocessor: str = field(`
			`default=None,`
			`metadata={`
			`'help': 'The preprocessor type',`
			`'cfg_node': 'preprocessor.type'`
			`})`

Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00
			`config, args = TextClassificationArguments().parse_cli().to_config()`

			`print(config, args)`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00

Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`def cfg_modify_fn(cfg):`
			`if args.use_model_config:`
			`cfg.merge_from_dict(config)`
			`else:`
			`cfg = config`
			`cfg.model['num_labels'] = len(cfg.preprocessor.label2id)`
			`if cfg.train.lr_scheduler.type == 'LinearLR':`
			`cfg.train.lr_scheduler['total_iters'] = \`
			`int(len(train_dataset) / cfg.train.dataloader.batch_size_per_gpu) * cfg.train.max_epochs`
			`return cfg`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00

Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`if args.dataset_json_file is None:`
add model revision in training_args and modify dataset loading in finetune text classification 1.add parameter model_revision in training_args.py. 2.add parameter model_revision in kwargs for finetune_text_classification.py and finetune_text_generation.py. 3.modify dataset loading in finetune_text_classification.py for flex training. Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12869552 * add model revision in training_args and modify dataset loading in finetune text classification 2023-06-08 19:36:05 +08:00			`train_dataset = MsDataset.load(`
			`args.train_dataset_name,`
			`subset_name=args.train_subset_name,`
			`split=args.train_split,`
			`namespace=args.train_dataset_namespace)`
			`validation_dataset = MsDataset.load(`
			`args.val_dataset_name,`
			`subset_name=args.val_subset_name,`
			`split=args.val_split,`
			`namespace=args.val_dataset_namespace)`
Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`else:`
			`train_dataset, validation_dataset = build_dataset_from_file(`
			`args.dataset_json_file)`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00
			`kwargs = dict(`
			`model=args.model,`
add model revision in training_args and modify dataset loading in finetune text classification 1.add parameter model_revision in training_args.py. 2.add parameter model_revision in kwargs for finetune_text_classification.py and finetune_text_generation.py. 3.modify dataset loading in finetune_text_classification.py for flex training. Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12869552 * add model revision in training_args and modify dataset loading in finetune text classification 2023-06-08 19:36:05 +08:00			`model_revision=args.model_revision,`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00			`train_dataset=train_dataset,`
			`eval_dataset=validation_dataset,`
			`seed=args.seed,`
Support FlexTrain and update the structure of trainer 1. Refactor training_args 2. Refactor hooks 3. Add train_id for push_to_hub 4. Support both output_dir/output_sub_dir for checkpoint_hooks 5. Support copy when hardlink fails when checkpointing 6. Support mixed dataset config file as a CLI argument 7. Add eval txt in output folder Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/12384253 * support the ignorance of file pattern 2023-05-13 12:12:04 +08:00			`cfg_modify_fn=cfg_modify_fn)`
update training args Based on feat/0131/nlp_args branch, the original code review: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/11408570 Support for running finetuning from the command line with training args, Compatible with the configuration optimization. 2023-02-10 05:32:21 +00:00
			`os.environ['LOCAL_RANK'] = str(args.local_rank)`
			`trainer: EpochBasedTrainer = build_trainer(name='trainer', default_args=kwargs)`
			`trainer.train()`