Skip to content

Latest commit



177 lines (143 loc) · 8.28 KB

File metadata and controls

177 lines (143 loc) · 8.28 KB

How-to Guide for Developers

Build your own ops

  • Data-modori allows everybody to build their own ops.
  • Before implementing a new op, please refer to Operators to avoid unnecessary duplication.
  • Assuming we want to add a new Filter operator called "TextLengthFilter" to get corpus of expected text length, we can follow these steps to build it.
  1. (Optional) Add a new StatsKeys in data_modori/utils/ to store the statistical variable of the new op.
class StatsKeys(object):
    ...              # other keys
    text_len = 'text_len'
  1. Create a new op file in the corresponding data_modori/ops/filter/ directory as follows.
    • Because it's a Filter op, so the new op needs to inherit from the basic Filter class in the, and be decorated with OPERATORS to register itself automatically.
import sys

from jsonargparse.typing import PositiveInt

from data_modori.utils.constant import Fields, StatsKeys

from ..base_op import OPERATORS, Filter

class TextLengthFilter(Filter):
    """Filter to keep samples with total text length within a specific

    def __init__(self,
                 min_len: PositiveInt = 10,
                 max_len: PositiveInt = sys.maxsize,
        Initialization method.

        :param min_len: The min text length in the filtering. samples
            will be filtered if their text length is below this
        :param max_len: The max text length in the filtering. samples
            will be filtered if their text length exceeds this
        :param args: extra args
        :param kwargs: extra args
        super().__init__(*args, **kwargs)
        self.min_len = min_len
        self.max_len = max_len

    def compute_stats(self, sample):
        # check if it's computed already
        if StatsKeys.text_len in sample[Fields.stats]:
            return sample

        sample[Fields.stats][StatsKeys.text_len] = len(sample[self.text_key])
        return sample

    def process(self, sample):
        if self.min_len <= sample[Fields.stats][StatsKeys.text_len] <= self.max_len:
            return True
            return False
  1. After implemention, add it to the op dictionary in the file in data_modori/ops/filter/ directory.
from . import (...,              # other ops
               text_length_filter)  # import this new op module
  1. Now you can use this new op with custom arguments in your own config files!
# other configs

# process configs
  - text_length_filter:  # add this op to your process list and set the parameters
      min_len: 10
      max_len: 1000

Build your own configs

  • We provide easy configuration based on jsonargparse to reduce cost for boilerplate codes.

Fruitful config sources & Type hints

  • A global config object can be initialized via
self.cfg = init_configs()
  • in which function arguments from diverse sources can be specified and mixed up, including
  1. hard-coded default values when registering the config into parser or specified in the classes' __init__ functions
  2. default config files in json (yaml or jsonnet supersets)
  3. environment variables
  4. POSIX-style command line arguments, such as --project_name my_data_demo or --project_name=my_data_demo , including config files
  • The final parsed values are mixed from these sources. And the override order is the same as the numbers above.

Besides, many argument types and respective validation are supported. Including python built-in types, types from Lib/typing module, and extended types from jsonargparse, such as restricted types and Paths with customized limitations.

Hierarchical configs and helps

  • You can use dot notation in the argument names freely to define the hierarchy, e.g., maximum_line_length_filter.min. More importantly, by default, we automatically register the configs from the docstrings of implemented operators. That is, the structure of all configs are always in sync with codes.

  • You can get the hierarchical help information by running a script that calls our executor such as

$ python tools/ --help

usage: [-h] [--config CONFIG] [--print_config[=flags]] [--project_name PROJECT_NAME] [--dataset_path DATASET_PATH] [--dataset_dir DATASET_DIR] [--export_path EXPORT_PATH] [--process PROCESS]
                            [--np NP] [--text_keys TEXT_KEYS] [--document_deduplicator CONFIG] [--document_deduplicator.hash_method HASH_METHOD] [--document_deduplicator.lowercase LOWERCASE]
                            [--document_deduplicator.ignore_non_character IGNORE_NON_CHARACTER] [--language_id_score_filter CONFIG] [--language_id_score_filter.lang LANG] [--words_num_filter CONFIG] [--words_num_filter.min MIN] [--words_num_filter.max MAX]
                            [--alphanumeric_filter CONFIG] [--alphanumeric_filter.min MIN] [--alphanumeric_filter.max MAX] [--average_line_length_filter CONFIG] [--average_line_length_filter.min MIN] [--average_line_length_filter.max MAX]
                            [--maximum_line_length_filter CONFIG] [--maximum_line_length_filter.min MIN] [--maximum_line_length_filter.max MAX] [--text_length_filter CONFIG] [--text_length_filter.min MIN] [--text_length_filter.max MAX]
                            [--remove_comments_mapper CONFIG] [--remove_comments_mapper.type TYPE] [--remove_comments_mapper.inline INLINE] [--remove_comments_mapper.multiline MULTILINE] [--remove_header_mapper CONFIG]
                            [--remove_header_mapper.before_section BEFORE_SECTION]

optional arguments:
  -h, --help            Show this help message and exit.
  --config CONFIG       Path to a configuration file.
                        Print the configuration after applying all other arguments and exit. The optional flags customizes the output and are one or more keywords separated by comma. The supported flags are: comments, skip_default, skip_null.
  --project_name PROJECT_NAME
                        name of your data process project. (type: str, default: null)
  --dataset_path DATASET_PATH
                        path to your dataset file, relative with respect to the config file’s location (type: Path_fr, default: null)
  --dataset_dir DATASET_DIR
                        path to your dataset(s) within a directory, relative with respect to the config file’s location (type: Path_drw, default: null)
  --export_path EXPORT_PATH
                        path to the output processed dataset, relative with respect to the config file’s location (type: Path_fc, default: null)
  --process PROCESS, --process+ PROCESS
                        a list of several process operators with their arguments (type: List[Dict], default: null)
  --np NP               number of subprocess to process your dataset. (type: PositiveInt, default: null)

<class 'data_modori.ops.filter.alphanumeric_filter.AlphanumericFilter'>:
  --alphanumeric_filter CONFIG
                        Path to a configuration file.
  --alphanumeric_filter.min MIN
                        the min filter rate in alphanumeric op. (type: ClosedUnitInterval, default: 0.0)
  --alphanumeric_filter.max MAX
                        the max filter rate in alphanumeric op. (type: ClosedUnitInterval, default: 0.25)

<class 'data_modori.ops.filter.text_length_filter.TextLengthFilter'>:
  --text_length_filter CONFIG
                        Path to a configuration file.
  --text_length_filter.min MIN
                        min text length in the filtering (type: int, default: 10)
  --text_length_filter.max MAX
                        max text length in the filtering (type: int, default: 10000)
