[to #42322933] nlp preprocessor refactor

Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9269314 * init * token to ids * add model * model forward ready * add intent * intent preprocessor ready * intent success * merge master * test with model hub * add flake8 * update * update * update * Merge branch 'master' into nlp/space/gen * delete file about gen * init * fix flake8 bug * [to #42322933] init * bug fix * [to #42322933] init * update pipeline registry info * Merge remote-tracking branch 'origin/master' into feat/nli * [to #42322933] init * [to #42322933] init * modify forward * [to #42322933] init * generation ready * init * Merge branch 'master' into feat/zero_shot_classification # Conflicts: # modelscope/preprocessors/__init__.py * [to #42322933] bugfix * [to #42322933] pre commit fix * fill mask * registry multi models on model and pipeline * add tests * test level >= 0 * local gen ready * merge with master * dialog modeling ready * fix comments: rename and refactor AliceMindMLM; adjust pipeline * space intent and modeling(generation) are ready * bug fix * add dep * add dep * support dst data processor * merge with nlp/space/dst * merge with master * Merge remote-tracking branch 'origin' into feat/fill_mask Conflicts: modelscope/models/nlp/__init__.py modelscope/pipelines/builder.py modelscope/pipelines/outputs.py modelscope/preprocessors/nlp.py requirements/nlp.txt * merge with master * merge with master 2/2 * fix comments * fix isort for pre-commit check * allow params pass to pipeline's __call__ method * Merge remote-tracking branch 'origin/master' into feat/zero_shot_classification * merge with nli task * merge with sentiment_classification * merge with zero_shot_classfication * merge with fill_mask * merge with space * merge with master head * Merge remote-tracking branch 'origin' into feat/fill_mask Conflicts: modelscope/utils/constant.py * fix: pipeline module_name from model_type to 'fill_mask' & fix merge bug * unfiinished change * fix bug * unfinished * unfinished * revise modelhub dependency * Merge branch 'feat/nlp_refactor' of http://gitlab.alibaba-inc.com/Ali-MaaS/MaaS-lib into feat/nlp_refactor * add eval() to pipeline call * add test level * ut run passed * add default args * tmp * merge master * all ut passed * remove an useless enum * revert a mis modification * revert a mis modification * Merge commit 'ace8af92465f7d772f035aebe98967726655f12c' into feat/nlp * commit 'ace8af92465f7d772f035aebe98967726655f12c': [to #42322933] Add cv-action-recongnition-pipeline to maas lib [to #42463204] support Pil.Image for image_captioning_pipeline [to #42670107] restore pydataset test [to #42322933] add create if not exist and add(back) create model example Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9130661 [to #41474818]fix: fix errors in task name definition # Conflicts: # modelscope/pipelines/builder.py # modelscope/utils/constant.py * Merge branch 'feat/nlp' into feat/nlp_refactor * feat/nlp: [to #42322933] Add cv-action-recongnition-pipeline to maas lib [to #42463204] support Pil.Image for image_captioning_pipeline [to #42670107] restore pydataset test [to #42322933] add create if not exist and add(back) create model example Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9130661 [to #41474818]fix: fix errors in task name definition # Conflicts: # modelscope/pipelines/builder.py * fix compile bug * refactor space * Merge branch 'feat/nlp_refactor' of http://gitlab.alibaba-inc.com/Ali-MaaS/MaaS-lib into feat/nlp_refactor * Merge remote-tracking branch 'origin' into feat/fill_mask * fix * pre-commit lint * lint file * lint file * lint file * update modelhub dependency * lint file * ignore dst_processor temporary * solve comment: 1. change MaskedLMModelBase to MaskedLanguageModelBase 2. remove a useless import * recommit * remove MaskedLanguageModel from __all__ * Merge commit '1a0d4af55a2eee69d89633874890f50eda8f8700' into feat/nlp_refactor * commit '1a0d4af55a2eee69d89633874890f50eda8f8700': [to #42322933] test level check Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9143809 [to #42322933] update nlp models name in metainfo Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9134657 # Conflicts: # modelscope/metainfo.py * update * revert pipeline params update * remove zeroshot * update sequence classfication outpus * merge with fill mask * Merge remote-tracking branch 'origin' into feat/fill_mask * fix * fix flake8 warning of dst * Merge remote-tracking branch 'origin/feat/fill_mask' into feat/nlp * merge with master * remove useless test.py * Merge remote-tracking branch 'origin/master' into feat/nlp * remove unformatted space trainer * revise based on comment except chinease comment * skip ci blocking * translation pipeline * csanmt model for translation pipeline * update * update * update builder.py * change Chinese notes of space3.0 into English * translate chinese comment to english * add space to metainfo * update casnmt_translation * update csanmt transformer * merge with master * update csanmt translation * update lint * update metainfo.py * Update translation_pipeline.py * Update builder.py * fix: 1. make csanmt derived from Model 2. add kwargs to prevent from call error * pre-commit check * temp exclue flake8 * temp ignore translation files * fix bug * pre-commit passed * fixbug * fixbug * revert pre commit ignorance * pre-commit passed * fix bug * merge with master * add missing setting * merge with master * add outputs * modify test level * modify chinese comment * remove useless doc * space outputs normalization * Merge remote-tracking branch 'origin/master' into nlp/translation * update translation_pipeline.py * Merge remote-tracking branch 'origin/master' into feat/nlp * Merge remote-tracking branch 'origin/master' into nlp/translation * add new __init__ method * add new __init__ method * update output format * Merge remote-tracking branch 'origin/master' into feat/nlp * update output format * merge with master * merge with nlp/translate * update the translation comment * update the translation comment * Merge branch 'nlp/translation' into feat/nlp * Merge remote-tracking branch 'origin/master' into feat/nlp * Merge remote-tracking branch 'origin/master' into feat/nlp * nlp preprocessor refactor * add get_model_type in util.hub * update the default preprocessor args * update the fill mask preprocessor * bug typo fixed
2026-02-25 04:30:48 +01:00 · 2022-07-05 20:40:48 +08:00
parent a84c80cd57
commit df42427ae0
9 changed files with 129 additions and 266 deletions
--- a/docs/source/tutorials/pipeline.md
+++ b/docs/source/tutorials/pipeline.md
@@ -37,9 +37,9 @@ pipeline函数支持传入实例化的预处理对象、模型对象，从而支
 1. 首先，创建预处理方法和模型
 ```python
 from modelscope.models import Model
-from modelscope.preprocessors import TokenClassifcationPreprocessor
+from modelscope.preprocessors import TokenClassificationPreprocessor
 model = Model.from_pretrained('damo/nlp_structbert_word-segmentation_chinese-base')
-tokenizer = TokenClassifcationPreprocessor(model.model_dir)
+tokenizer = TokenClassificationPreprocessor(model.model_dir)
 ```

 2. 使用tokenizer和模型对象创建pipeline
--- a/modelscope/metainfo.py
+++ b/modelscope/metainfo.py
@@ -106,6 +106,7 @@ class Preprocessors(object):
    load_image = 'load-image'

    # nlp preprocessor
+    sen_sim_tokenizer = 'sen-sim-tokenizer'
    bert_seq_cls_tokenizer = 'bert-seq-cls-tokenizer'
    palm_text_gen_tokenizer = 'palm-text-gen-tokenizer'
    token_cls_tokenizer = 'token-cls-tokenizer'
--- a/modelscope/pipelines/nlp/sentence_similarity_pipeline.py
+++ b/modelscope/pipelines/nlp/sentence_similarity_pipeline.py
@@ -6,7 +6,7 @@ import torch
 from ...metainfo import Pipelines
 from ...models import Model
 from ...models.nlp import SbertForSentenceSimilarity
-from ...preprocessors import SequenceClassificationPreprocessor
+from ...preprocessors import SentenceSimilarityPreprocessor
 from ...utils.constant import Tasks
 from ..base import Input, Pipeline
 from ..builder import PIPELINES
@@ -21,7 +21,7 @@ class SentenceSimilarityPipeline(Pipeline):

    def __init__(self,
                 model: Union[Model, str],
-                 preprocessor: SequenceClassificationPreprocessor = None,
+                 preprocessor: SentenceSimilarityPreprocessor = None,
                 first_sequence='first_sequence',
                 second_sequence='second_sequence',
                 **kwargs):
@@ -29,7 +29,7 @@ class SentenceSimilarityPipeline(Pipeline):

        Args:
            model (SbertForSentenceSimilarity): a model instance
-            preprocessor (SequenceClassificationPreprocessor): a preprocessor instance
+            preprocessor (SentenceSimilarityPreprocessor): a preprocessor instance
        """
        assert isinstance(model, str) or isinstance(model, SbertForSentenceSimilarity), \
            'model must be a single str or SbertForSentenceSimilarity'
@@ -37,7 +37,7 @@ class SentenceSimilarityPipeline(Pipeline):
            model,
            SbertForSentenceSimilarity) else Model.from_pretrained(model)
        if preprocessor is None:
-            preprocessor = SequenceClassificationPreprocessor(
+            preprocessor = SentenceSimilarityPreprocessor(
                sc_model.model_dir,
                first_sequence=first_sequence,
                second_sequence=second_sequence)
--- a/modelscope/pipelines/nlp/text_generation_pipeline.py
+++ b/modelscope/pipelines/nlp/text_generation_pipeline.py
@@ -22,11 +22,11 @@ class TextGenerationPipeline(Pipeline):
                 model: Union[PalmForTextGeneration, str],
                 preprocessor: Optional[TextGenerationPreprocessor] = None,
                 **kwargs):
-        """use `model` and `preprocessor` to create a nlp text classification pipeline for prediction
+        """use `model` and `preprocessor` to create a nlp text generation pipeline for prediction

        Args:
-            model (SequenceClassificationModel): a model instance
-            preprocessor (SequenceClassificationPreprocessor): a preprocessor instance
+            model (PalmForTextGeneration): a model instance
+            preprocessor (TextGenerationPreprocessor): a preprocessor instance
        """
        model = model if isinstance(
            model, PalmForTextGeneration) else Model.from_pretrained(model)
--- a/modelscope/pipelines/nlp/word_segmentation_pipeline.py
+++ b/modelscope/pipelines/nlp/word_segmentation_pipeline.py
@@ -5,7 +5,7 @@ import torch
 from ...metainfo import Pipelines
 from ...models import Model
 from ...models.nlp import SbertForTokenClassification
-from ...preprocessors import TokenClassifcationPreprocessor
+from ...preprocessors import TokenClassificationPreprocessor
 from ...utils.constant import Tasks
 from ..base import Pipeline, Tensor
 from ..builder import PIPELINES
@@ -18,21 +18,22 @@ __all__ = ['WordSegmentationPipeline']
    Tasks.word_segmentation, module_name=Pipelines.word_segmentation)
 class WordSegmentationPipeline(Pipeline):

-    def __init__(self,
-                 model: Union[SbertForTokenClassification, str],
-                 preprocessor: Optional[TokenClassifcationPreprocessor] = None,
-                 **kwargs):
+    def __init__(
+            self,
+            model: Union[SbertForTokenClassification, str],
+            preprocessor: Optional[TokenClassificationPreprocessor] = None,
+            **kwargs):
        """use `model` and `preprocessor` to create a nlp word segmentation pipeline for prediction

        Args:
            model (StructBertForTokenClassification): a model instance
-            preprocessor (TokenClassifcationPreprocessor): a preprocessor instance
+            preprocessor (TokenClassificationPreprocessor): a preprocessor instance
        """
        model = model if isinstance(
            model,
            SbertForTokenClassification) else Model.from_pretrained(model)
        if preprocessor is None:
-            preprocessor = TokenClassifcationPreprocessor(model.model_dir)
+            preprocessor = TokenClassificationPreprocessor(model.model_dir)
        model.eval()
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        self.tokenizer = preprocessor.tokenizer
--- a/modelscope/preprocessors/nlp.py
+++ b/modelscope/preprocessors/nlp.py
@@ -5,7 +5,8 @@ from typing import Any, Dict, Union

 from transformers import AutoTokenizer

-from ..metainfo import Models, Preprocessors
+from ..metainfo import Preprocessors
+from ..models import Model
 from ..utils.constant import Fields, InputFields
 from ..utils.type_assert import type_assert
 from .base import Preprocessor
@@ -13,9 +14,10 @@ from .builder import PREPROCESSORS

 __all__ = [
    'Tokenize', 'SequenceClassificationPreprocessor',
-    'TextGenerationPreprocessor', 'TokenClassifcationPreprocessor',
+    'TextGenerationPreprocessor', 'TokenClassificationPreprocessor',
    'NLIPreprocessor', 'SentimentClassificationPreprocessor',
-    'FillMaskPreprocessor', 'ZeroShotClassificationPreprocessor'
+    'FillMaskPreprocessor', 'SentenceSimilarityPreprocessor',
+    'ZeroShotClassificationPreprocessor'
 ]


@@ -33,9 +35,7 @@ class Tokenize(Preprocessor):
        return data


-@PREPROCESSORS.register_module(
-    Fields.nlp, module_name=Preprocessors.nli_tokenizer)
-class NLIPreprocessor(Preprocessor):
+class NLPPreprocessorBase(Preprocessor):

    def __init__(self, model_dir: str, *args, **kwargs):
        """preprocess the data via the vocab.txt from the `model_dir` path
@@ -45,18 +45,19 @@ class NLIPreprocessor(Preprocessor):
        """

        super().__init__(*args, **kwargs)
-
-        from sofa import SbertTokenizer
        self.model_dir: str = model_dir
        self.first_sequence: str = kwargs.pop('first_sequence',
                                              'first_sequence')
        self.second_sequence = kwargs.pop('second_sequence', 'second_sequence')
-        self.sequence_length = kwargs.pop('sequence_length', 128)
+        self.tokenize_kwargs = kwargs
+        self.tokenizer = self.build_tokenizer(model_dir)

-        self.tokenizer = SbertTokenizer.from_pretrained(self.model_dir)
+    def build_tokenizer(self, model_dir):
+        from sofa import SbertTokenizer
+        return SbertTokenizer.from_pretrained(model_dir)

-    @type_assert(object, tuple)
-    def __call__(self, data: tuple) -> Dict[str, Any]:
+    @type_assert(object, object)
+    def __call__(self, data: Union[str, tuple, Dict]) -> Dict[str, Any]:
        """process the raw input data

        Args:
@@ -70,101 +71,54 @@ class NLIPreprocessor(Preprocessor):
        Returns:
            Dict[str, Any]: the preprocessed data
        """
-        sentence1, sentence2 = data
-        new_data = {
-            self.first_sequence: sentence1,
-            self.second_sequence: sentence2
-        }
-        # preprocess the data for the model input

-        rst = {
-            'id': [],
-            'input_ids': [],
-            'attention_mask': [],
-            'token_type_ids': []
-        }
+        text_a, text_b = None, None
+        if isinstance(data, str):
+            text_a = data
+        elif isinstance(data, tuple):
+            assert len(data) == 2
+            text_a, text_b = data
+        elif isinstance(data, dict):
+            text_a = data.get(self.first_sequence)
+            text_b = data.get(self.second_sequence, None)

-        max_seq_length = self.sequence_length
+        return self.tokenizer(text_a, text_b, **self.tokenize_kwargs)

-        text_a = new_data[self.first_sequence]
-        text_b = new_data[self.second_sequence]
-        feature = self.tokenizer(
-            text_a,
-            text_b,
-            padding=False,
-            truncation=True,
-            max_length=max_seq_length)

-        rst['id'].append(new_data.get('id', str(uuid.uuid4())))
-        rst['input_ids'].append(feature['input_ids'])
-        rst['attention_mask'].append(feature['attention_mask'])
-        rst['token_type_ids'].append(feature['token_type_ids'])
+@PREPROCESSORS.register_module(
+    Fields.nlp, module_name=Preprocessors.nli_tokenizer)
+class NLIPreprocessor(NLPPreprocessorBase):

-        return rst
+    def __init__(self, model_dir: str, *args, **kwargs):
+        kwargs['truncation'] = True
+        kwargs['padding'] = False
+        kwargs['return_tensors'] = 'pt'
+        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
+        super().__init__(model_dir, *args, **kwargs)


@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
-class SentimentClassificationPreprocessor(Preprocessor):
+class SentimentClassificationPreprocessor(NLPPreprocessorBase):

    def __init__(self, model_dir: str, *args, **kwargs):
-        """preprocess the data via the vocab.txt from the `model_dir` path
+        kwargs['truncation'] = True
+        kwargs['padding'] = 'max_length'
+        kwargs['return_tensors'] = 'pt'
+        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
+        super().__init__(model_dir, *args, **kwargs)

-        Args:
-            model_dir (str): model path
-        """

-        super().__init__(*args, **kwargs)
+@PREPROCESSORS.register_module(
+    Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
+class SentenceSimilarityPreprocessor(NLPPreprocessorBase):

-        from sofa import SbertTokenizer
-        self.model_dir: str = model_dir
-        self.first_sequence: str = kwargs.pop('first_sequence',
-                                              'first_sequence')
-        self.second_sequence = kwargs.pop('second_sequence', 'second_sequence')
-        self.sequence_length = kwargs.pop('sequence_length', 128)
-
-        self.tokenizer = SbertTokenizer.from_pretrained(self.model_dir)
-
-    @type_assert(object, str)
-    def __call__(self, data: str) -> Dict[str, Any]:
-        """process the raw input data
-
-        Args:
-            data (str): a sentence
-                Example:
-                    'you are so handsome.'
-        Returns:
-            Dict[str, Any]: the preprocessed data
-        """
-
-        new_data = {self.first_sequence: data}
-        # preprocess the data for the model input
-
-        rst = {
-            'id': [],
-            'input_ids': [],
-            'attention_mask': [],
-            'token_type_ids': []
-        }
-
-        max_seq_length = self.sequence_length
-
-        text_a = new_data[self.first_sequence]
-
-        text_b = new_data.get(self.second_sequence, None)
-        feature = self.tokenizer(
-            text_a,
-            text_b,
-            padding='max_length',
-            truncation=True,
-            max_length=max_seq_length)
-
-        rst['id'].append(new_data.get('id', str(uuid.uuid4())))
-        rst['input_ids'].append(feature['input_ids'])
-        rst['attention_mask'].append(feature['attention_mask'])
-        rst['token_type_ids'].append(feature['token_type_ids'])
-
-        return rst
+    def __init__(self, model_dir: str, *args, **kwargs):
+        kwargs['truncation'] = True
+        kwargs['padding'] = False
+        kwargs['return_tensors'] = 'pt'
+        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
+        super().__init__(model_dir, *args, **kwargs)


@PREPROCESSORS.register_module(
@@ -192,36 +146,7 @@ class SequenceClassificationPreprocessor(Preprocessor):

    @type_assert(object, (str, tuple, Dict))
    def __call__(self, data: Union[str, tuple, Dict]) -> Dict[str, Any]:
-        """process the raw input data
-
-        Args:
-            data (str or tuple, Dict):
-            sentence1 (str): a sentence
-                    Example:
-                        'you are so handsome.'
-            or
-            (sentence1, sentence2)
-                sentence1 (str): a sentence
-                    Example:
-                        'you are so handsome.'
-                sentence2 (str): a sentence
-                    Example:
-                        'you are so beautiful.'
-            or
-            {field1: field_value1, field2: field_value2}
-            field1 (str): field name, default 'first_sequence'
-            field_value1 (str): a sentence
-                    Example:
-                        'you are so handsome.'
-
-            field2 (str): field name, default 'second_sequence'
-            field_value2 (str): a sentence
-                Example:
-                    'you are so beautiful.'
-
-        Returns:
-            Dict[str, Any]: the preprocessed data
-        """
+        feature = super().__call__(data)
        if isinstance(data, str):
            new_data = {self.first_sequence: data}
        elif isinstance(data, tuple):
@@ -263,136 +188,55 @@ class SequenceClassificationPreprocessor(Preprocessor):

@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.palm_text_gen_tokenizer)
-class TextGenerationPreprocessor(Preprocessor):
+class TextGenerationPreprocessor(NLPPreprocessorBase):

    def __init__(self, model_dir: str, tokenizer, *args, **kwargs):
-        """preprocess the data using the vocab.txt from the `model_dir` path
-
-        Args:
-            model_dir (str): model path
-        """
-        super().__init__(*args, **kwargs)
-
-        self.model_dir: str = model_dir
-        self.first_sequence: str = kwargs.pop('first_sequence',
-                                              'first_sequence')
-        self.second_sequence: str = kwargs.pop('second_sequence',
-                                               'second_sequence')
-        self.sequence_length: int = kwargs.pop('sequence_length', 128)
        self.tokenizer = tokenizer
+        kwargs['truncation'] = True
+        kwargs['padding'] = 'max_length'
+        kwargs['return_tensors'] = 'pt'
+        kwargs['return_token_type_ids'] = False
+        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
+        super().__init__(model_dir, *args, **kwargs)

-    @type_assert(object, str)
-    def __call__(self, data: str) -> Dict[str, Any]:
-        """process the raw input data
-
-        Args:
-            data (str): a sentence
-                Example:
-                    'you are so handsome.'
-
-        Returns:
-            Dict[str, Any]: the preprocessed data
-        """
-        import torch
-
-        new_data = {self.first_sequence: data}
-        # preprocess the data for the model input
-
-        rst = {'input_ids': [], 'attention_mask': []}
-
-        max_seq_length = self.sequence_length
-
-        text_a = new_data.get(self.first_sequence, None)
-        text_b = new_data.get(self.second_sequence, None)
-        feature = self.tokenizer(
-            text_a,
-            text_b,
-            padding='max_length',
-            truncation=True,
-            max_length=max_seq_length)
-
-        rst['input_ids'].append(feature['input_ids'])
-        rst['attention_mask'].append(feature['attention_mask'])
-        return {k: torch.tensor(v) for k, v in rst.items()}
+    def build_tokenizer(self, model_dir):
+        return self.tokenizer


@PREPROCESSORS.register_module(Fields.nlp)
-class FillMaskPreprocessor(Preprocessor):
+class FillMaskPreprocessor(NLPPreprocessorBase):

    def __init__(self, model_dir: str, *args, **kwargs):
-        """preprocess the data via the vocab.txt from the `model_dir` path
+        kwargs['truncation'] = True
+        kwargs['padding'] = 'max_length'
+        kwargs['return_tensors'] = 'pt'
+        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
+        kwargs['return_token_type_ids'] = True
+        super().__init__(model_dir, *args, **kwargs)

-        Args:
-            model_dir (str): model path
-        """
-        super().__init__(*args, **kwargs)
-        self.model_dir = model_dir
-        self.first_sequence: str = kwargs.pop('first_sequence',
-                                              'first_sequence')
-        self.sequence_length = kwargs.pop('sequence_length', 128)
-        try:
-            from transformers import AutoTokenizer
-            self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
-        except KeyError:
-            from sofa.utils.backend import AutoTokenizer
-            self.tokenizer = AutoTokenizer.from_pretrained(
-                model_dir, use_fast=False)
-
-    @type_assert(object, str)
-    def __call__(self, data: str) -> Dict[str, Any]:
-        """process the raw input data
-
-        Args:
-            data (str): a sentence
-                Example:
-                    'you are so handsome.'
-
-        Returns:
-            Dict[str, Any]: the preprocessed data
-        """
-        import torch
-
-        new_data = {self.first_sequence: data}
-        # preprocess the data for the model input
-
-        rst = {'input_ids': [], 'attention_mask': [], 'token_type_ids': []}
-
-        max_seq_length = self.sequence_length
-
-        text_a = new_data[self.first_sequence]
-        feature = self.tokenizer(
-            text_a,
-            padding='max_length',
-            truncation=True,
-            max_length=max_seq_length,
-            return_token_type_ids=True)
-
-        rst['input_ids'].append(feature['input_ids'])
-        rst['attention_mask'].append(feature['attention_mask'])
-        rst['token_type_ids'].append(feature['token_type_ids'])
-
-        return {k: torch.tensor(v) for k, v in rst.items()}
+    def build_tokenizer(self, model_dir):
+        from ..utils.hub import get_model_type
+        model_type = get_model_type(model_dir)
+        if model_type in ['sbert', 'structbert', 'bert']:
+            from sofa import SbertTokenizer
+            return SbertTokenizer.from_pretrained(model_dir, use_fast=False)
+        elif model_type == 'veco':
+            from sofa import VecoTokenizer
+            return VecoTokenizer.from_pretrained(model_dir, use_fast=False)
+        else:
+            # TODO Only support veco & sbert
+            raise RuntimeError(f'Unsupported model type: {model_type}')


@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.token_cls_tokenizer)
-class TokenClassifcationPreprocessor(Preprocessor):
+class TokenClassificationPreprocessor(NLPPreprocessorBase):

    def __init__(self, model_dir: str, *args, **kwargs):
-        """preprocess the data via the vocab.txt from the `model_dir` path
-
-        Args:
-            model_dir (str): model path
-        """
-
-        super().__init__(*args, **kwargs)
-
-        from sofa import SbertTokenizer
-        self.model_dir: str = model_dir
-        self.tokenizer = SbertTokenizer.from_pretrained(self.model_dir)
+        super().__init__(model_dir, *args, **kwargs)

    @type_assert(object, str)
-    def __call__(self, data: str) -> Dict[str, Any]:
+    def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]:
        """process the raw input data

        Args:
@@ -405,7 +249,8 @@ class TokenClassifcationPreprocessor(Preprocessor):
        """

        # preprocess the data for the model input
-
+        if isinstance(data, dict):
+            data = data[self.first_sequence]
        text = data.replace(' ', '').strip()
        tokens = []
        for token in text:
@@ -425,7 +270,7 @@ class TokenClassifcationPreprocessor(Preprocessor):

@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
-class ZeroShotClassificationPreprocessor(Preprocessor):
+class ZeroShotClassificationPreprocessor(NLPPreprocessorBase):

    def __init__(self, model_dir: str, *args, **kwargs):
        """preprocess the data via the vocab.txt from the `model_dir` path
@@ -433,16 +278,11 @@ class ZeroShotClassificationPreprocessor(Preprocessor):
        Args:
            model_dir (str): model path
        """
-
-        super().__init__(*args, **kwargs)
-
-        from sofa import SbertTokenizer
-        self.model_dir: str = model_dir
        self.sequence_length = kwargs.pop('sequence_length', 512)
-        self.tokenizer = SbertTokenizer.from_pretrained(self.model_dir)
+        super().__init__(model_dir, *args, **kwargs)

    @type_assert(object, str)
-    def __call__(self, data: str, hypothesis_template: str,
+    def __call__(self, data, hypothesis_template: str,
                 candidate_labels: list) -> Dict[str, Any]:
        """process the raw input data

@@ -454,6 +294,9 @@ class ZeroShotClassificationPreprocessor(Preprocessor):
        Returns:
            Dict[str, Any]: the preprocessed data
        """
+        if isinstance(data, dict):
+            data = data.get(self.first_sequence)
+
        pairs = [[data, hypothesis_template.format(label)]
                 for label in candidate_labels]

--- a/modelscope/utils/hub.py
+++ b/modelscope/utils/hub.py
@@ -11,6 +11,9 @@ from modelscope.hub.file_download import model_file_download
 from modelscope.hub.snapshot_download import snapshot_download
 from modelscope.utils.config import Config
 from modelscope.utils.constant import ModelFile
+from .logger import get_logger
+
+logger = get_logger(__name__)


 def create_model_if_not_exist(
@@ -67,3 +70,18 @@ def auto_load(model: Union[str, List[str]]):
        ]

    return model
+
+
+def get_model_type(model_dir):
+    try:
+        configuration_file = osp.join(model_dir, ModelFile.CONFIGURATION)
+        config_file = osp.join(model_dir, 'config.json')
+        if osp.isfile(configuration_file):
+            cfg = Config.from_file(configuration_file)
+            return cfg.model.model_type if hasattr(cfg.model, 'model_type') and not hasattr(cfg.model, 'type') \
+                else cfg.model.type
+        elif osp.isfile(config_file):
+            cfg = Config.from_file(config_file)
+            return cfg.model_type if hasattr(cfg, 'model_type') else None
+    except Exception as e:
+        logger.error(f'parse config file failed with error: {e}')
--- a/tests/pipelines/test_sentence_similarity.py
+++ b/tests/pipelines/test_sentence_similarity.py
@@ -6,7 +6,7 @@ from modelscope.hub.snapshot_download import snapshot_download
 from modelscope.models import Model
 from modelscope.models.nlp import SbertForSentenceSimilarity
 from modelscope.pipelines import SentenceSimilarityPipeline, pipeline
-from modelscope.preprocessors import SequenceClassificationPreprocessor
+from modelscope.preprocessors import SentenceSimilarityPreprocessor
 from modelscope.utils.constant import Tasks
 from modelscope.utils.test_utils import test_level

@@ -19,7 +19,7 @@ class SentenceSimilarityTest(unittest.TestCase):
    @unittest.skipUnless(test_level() >= 2, 'skip test in current test level')
    def test_run(self):
        cache_path = snapshot_download(self.model_id)
-        tokenizer = SequenceClassificationPreprocessor(cache_path)
+        tokenizer = SentenceSimilarityPreprocessor(cache_path)
        model = SbertForSentenceSimilarity(cache_path, tokenizer=tokenizer)
        pipeline1 = SentenceSimilarityPipeline(model, preprocessor=tokenizer)
        pipeline2 = pipeline(
@@ -35,7 +35,7 @@ class SentenceSimilarityTest(unittest.TestCase):
    @unittest.skipUnless(test_level() >= 2, 'skip test in current test level')
    def test_run_with_model_from_modelhub(self):
        model = Model.from_pretrained(self.model_id)
-        tokenizer = SequenceClassificationPreprocessor(model.model_dir)
+        tokenizer = SentenceSimilarityPreprocessor(model.model_dir)
        pipeline_ins = pipeline(
            task=Tasks.sentence_similarity,
            model=model,
--- a/tests/pipelines/test_word_segmentation.py
+++ b/tests/pipelines/test_word_segmentation.py
@@ -6,7 +6,7 @@ from modelscope.hub.snapshot_download import snapshot_download
 from modelscope.models import Model
 from modelscope.models.nlp import SbertForTokenClassification
 from modelscope.pipelines import WordSegmentationPipeline, pipeline
-from modelscope.preprocessors import TokenClassifcationPreprocessor
+from modelscope.preprocessors import TokenClassificationPreprocessor
 from modelscope.utils.constant import Tasks
 from modelscope.utils.test_utils import test_level

@@ -18,7 +18,7 @@ class WordSegmentationTest(unittest.TestCase):
    @unittest.skipUnless(test_level() >= 2, 'skip test in current test level')
    def test_run_by_direct_model_download(self):
        cache_path = snapshot_download(self.model_id)
-        tokenizer = TokenClassifcationPreprocessor(cache_path)
+        tokenizer = TokenClassificationPreprocessor(cache_path)
        model = SbertForTokenClassification(cache_path, tokenizer=tokenizer)
        pipeline1 = WordSegmentationPipeline(model, preprocessor=tokenizer)
        pipeline2 = pipeline(
@@ -31,7 +31,7 @@ class WordSegmentationTest(unittest.TestCase):
    @unittest.skipUnless(test_level() >= 0, 'skip test in current test level')
    def test_run_with_model_from_modelhub(self):
        model = Model.from_pretrained(self.model_id)
-        tokenizer = TokenClassifcationPreprocessor(model.model_dir)
+        tokenizer = TokenClassificationPreprocessor(model.model_dir)
        pipeline_ins = pipeline(
            task=Tasks.word_segmentation, model=model, preprocessor=tokenizer)
        print(pipeline_ins(input=self.sentence))