- Data-modori allows everybody to build their own ops.
- Before implementing a new op, please refer to Operators to avoid unnecessary duplication.
- Assuming we want to add a new Filter operator called "TextLengthFilter" to get corpus of expected text length, we can follow these steps to build it.
- (Optional) Add a new StatsKeys in
data_modori/utils/constant.py
to store the statistical variable of the new op.
class StatsKeys(object):
... # other keys
text_len = 'text_len'
- Create a new op file
text_length_filter.py
in the correspondingdata_modori/ops/filter/
directory as follows.- Because it's a Filter op, so the new op needs to inherit from the basic
Filter
class in thebase_op.py
, and be decorated withOPERATORS
to register itself automatically.
- Because it's a Filter op, so the new op needs to inherit from the basic
import sys
from jsonargparse.typing import PositiveInt
from data_modori.utils.constant import Fields, StatsKeys
from ..base_op import OPERATORS, Filter
@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):
"""Filter to keep samples with total text length within a specific
range."""
def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
"""
Initialization method.
:param min_len: The min text length in the filtering. samples
will be filtered if their text length is below this
parameter.
:param max_len: The max text length in the filtering. samples
will be filtered if their text length exceeds this
parameter.
:param args: extra args
:param kwargs: extra args
"""
super().__init__(*args, **kwargs)
self.min_len = min_len
self.max_len = max_len
def compute_stats(self, sample):
# check if it's computed already
if StatsKeys.text_len in sample[Fields.stats]:
return sample
sample[Fields.stats][StatsKeys.text_len] = len(sample[self.text_key])
return sample
def process(self, sample):
if self.min_len <= sample[Fields.stats][StatsKeys.text_len] <= self.max_len:
return True
else:
return False
- After implemention, add it to the op dictionary in the
__init__.py
file indata_modori/ops/filter/
directory.
from . import (..., # other ops
text_length_filter) # import this new op module
- Now you can use this new op with custom arguments in your own config files!
# other configs
...
# process configs
process:
- text_length_filter: # add this op to your process list and set the parameters
min_len: 10
max_len: 1000
- We provide easy configuration based on jsonargparse to reduce cost for boilerplate codes.
- A global config object can be initialized via
# core.executor.py
self.cfg = init_configs()
- in which function arguments from diverse sources can be specified and mixed up, including
- hard-coded default values when registering the config into parser or specified in the classes'
__init__
functions - default config files in json (yaml or jsonnet supersets)
- environment variables
- POSIX-style command line arguments, such as
--project_name my_data_demo
or--project_name=my_data_demo
, including config files
- The final parsed values are mixed from these sources. And the override order is the same as the numbers above.
Besides, many argument types and respective validation are supported.
Including python built-in types, types from Lib/typing module, and
extended types
from jsonargparse, such as restricted types
and Paths
with customized
limitations.
-
You can use dot notation in the argument names freely to define the hierarchy, e.g.,
maximum_line_length_filter.min
. More importantly, by default, we automatically register the configs from the docstrings of implemented operators. That is, the structure of all configs are always in sync with codes. -
You can get the hierarchical help information by running a script that calls our executor such as
$ python tools/process_data.py --help
usage: process_data.py [-h] [--config CONFIG] [--print_config[=flags]] [--project_name PROJECT_NAME] [--dataset_path DATASET_PATH] [--dataset_dir DATASET_DIR] [--export_path EXPORT_PATH] [--process PROCESS]
[--np NP] [--text_keys TEXT_KEYS] [--document_deduplicator CONFIG] [--document_deduplicator.hash_method HASH_METHOD] [--document_deduplicator.lowercase LOWERCASE]
[--document_deduplicator.ignore_non_character IGNORE_NON_CHARACTER] [--language_id_score_filter CONFIG] [--language_id_score_filter.lang LANG] [--words_num_filter CONFIG] [--words_num_filter.min MIN] [--words_num_filter.max MAX]
[--alphanumeric_filter CONFIG] [--alphanumeric_filter.min MIN] [--alphanumeric_filter.max MAX] [--average_line_length_filter CONFIG] [--average_line_length_filter.min MIN] [--average_line_length_filter.max MAX]
[--maximum_line_length_filter CONFIG] [--maximum_line_length_filter.min MIN] [--maximum_line_length_filter.max MAX] [--text_length_filter CONFIG] [--text_length_filter.min MIN] [--text_length_filter.max MAX]
[--remove_comments_mapper CONFIG] [--remove_comments_mapper.type TYPE] [--remove_comments_mapper.inline INLINE] [--remove_comments_mapper.multiline MULTILINE] [--remove_header_mapper CONFIG]
[--remove_header_mapper.before_section BEFORE_SECTION]
optional arguments:
-h, --help Show this help message and exit.
--config CONFIG Path to a configuration file.
--print_config[=flags]
Print the configuration after applying all other arguments and exit. The optional flags customizes the output and are one or more keywords separated by comma. The supported flags are: comments, skip_default, skip_null.
--project_name PROJECT_NAME
name of your data process project. (type: str, default: null)
--dataset_path DATASET_PATH
path to your dataset file, relative with respect to the config file’s location (type: Path_fr, default: null)
--dataset_dir DATASET_DIR
path to your dataset(s) within a directory, relative with respect to the config file’s location (type: Path_drw, default: null)
--export_path EXPORT_PATH
path to the output processed dataset, relative with respect to the config file’s location (type: Path_fc, default: null)
--process PROCESS, --process+ PROCESS
a list of several process operators with their arguments (type: List[Dict], default: null)
--np NP number of subprocess to process your dataset. (type: PositiveInt, default: null)
<class 'data_modori.ops.filter.alphanumeric_filter.AlphanumericFilter'>:
--alphanumeric_filter CONFIG
Path to a configuration file.
--alphanumeric_filter.min MIN
the min filter rate in alphanumeric op. (type: ClosedUnitInterval, default: 0.0)
--alphanumeric_filter.max MAX
the max filter rate in alphanumeric op. (type: ClosedUnitInterval, default: 0.25)
<class 'data_modori.ops.filter.text_length_filter.TextLengthFilter'>:
--text_length_filter CONFIG
Path to a configuration file.
--text_length_filter.min MIN
min text length in the filtering (type: int, default: 10)
--text_length_filter.max MAX
max text length in the filtering (type: int, default: 10000)
......