Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cleaning script clean some num worng #8

Open
jiejie1993 opened this issue Jan 9, 2024 · 0 comments
Open

The cleaning script clean some num worng #8

jiejie1993 opened this issue Jan 9, 2024 · 0 comments
Labels
question Further information is requested

Comments

@jiejie1993
Copy link

`ORIGINAL| 2021-02-03 16:09:00 天润乳业公告,拟在新疆生产建设兵团第十二师222团投资建设10000头规模化奶牛示范牧场项目,222团予以提供牧场运营所需配套资源,222团保证二十年内免费提供本项目使用的设施农业用地,免征土地租赁费,并保障项目生产经营所需水、电等基础配套设施。 |

CLEANED| 02-03 16:09:00 天润乳业公告,拟在新疆生产建设兵团第十二师222团投资建设头规模化奶牛示范牧场项目,222团予以提供牧场运营所需配套资源,222团保证二十年内免费提供本项目使用的设施农业用地,免征土地租赁费,并保障项目生产经营所需水、电等基础配套设施。 |`
as shown above, the year num "2021" and the "10000" num is deleted, what config cause the deleting?

my config file is:
basic: batch_size: 3000 input: Astock_all_converted.jsonl is_jsonl: true num_workers: 32 output: Astock_all.jsonl result_key: target source_key: target extractors: ContentExtractor: save_key: pageContent TimeExtractor: save_key: pagePublishTime TitleExtractor: save_key: pageTitle filters: SimplifiedFilter: config_file: t2s.json SymbolFilter: filter_control: true filter_emoji: true TextCleaner: filter_extraspace: true filter_personal: true filter_url: true TextIntegrityChecker: do_end_clip: true double_mark_check: true end_mark_check: true length_check: true min_length: 16

@wuchengwei0122 wuchengwei0122 added the question Further information is requested label Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants