You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deep rewrite of Bitextor Snakefile for a vast performance improvement.
Some config parameters and intermediate generated files also changed, so reusing old config files and transient or permanent folders from old runs would introduce issues.
Snakemake project structure now matches the standard.
Machine translation system training now should be performed manually.
Added a new crawler: linguacrawl, specialized in full TLD crawling.
Added a new method for deferred crawling only using Murmurhash hashes at the sentence alignment step.
A reconstructor is also provided: deferred-annotation-reconstructor.sh
Added sharding, which groups domains into 1 GB shards for a more balanced job running, done via giashard (Golang Internet Archive SHARDing).
A new WARC processor has been implemented in C++: warc2text
It is faster than the previous text extraction tool giawarc (now deprecated) and warc2preprocess.
Although it has the same features as giawarc, it still lacks features like PDF processing or boilerplate removal that are available in warc2preprocess.
Multiple improvements to bitextor-warc2htmlwarc.py and bitextor-warc2preprocess.py:
Added lxml text extraction parsing library option, and html5lib as optional and additional parsing
html5lib is the cleanest supported parser but also the slowest
Deleted alcazar as all code and references from upstream vanished.
Fixed ‘simple’ text extraction parser for some table tags and new HTML5 tags.
ftfy is now disabled by default.
New translation based document aligner written in C++ (document-aligner folder)
Faster and less memory requirements than the previous Python code.
Moses tokenizers are now used by default through an efficient wrapper.
This will run by default if "wordTokenizers" is not defined in Bitextor configuration.
This is the recommended option if your language is supported by Moses.
Moses sentence splitter original script has been replaced with a faster port by Mediacloud.
This will run by default if "sentenceSplitters" is not defined in Bitextor configuration.
This is the recommended option if your language is supported by the latest Moses release version of the sentence splitter script.
Discord server is also up for a more live chat with other users and developers! Also there are some bots to keep you updated with some news about Bitextor development and related projects.
Notes
bitextor-v8.0.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.zip tarball or cloning the repo v8.0 tag.
We will support Bitextor 8.x branch until the next major version is released.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
"We have unfinished business.", Beatrix
v8.0 Changelog
deferred-annotation-reconstructor.sh
giawarc
(now deprecated) andwarc2preprocess
.warc2preprocess
.bitextor-warc2htmlwarc.py
andbitextor-warc2preprocess.py
:lxml
text extraction parsing library option, andhtml5lib
as optional and additional parsinghtml5lib
is the cleanest supported parser but also the slowestalcazar
as all code and references from upstream vanished.ftfy
is now disabled by default.document-aligner
folder)slurm
,nmt
workflow for MarianNMT orpdf-extract
(replaced by wrappers in WARC processors).edge
tag in Dockerhub).Notes
bitextor-v8.0.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first togit submodule update --init --recursive
. Also, you can't issue this command on the source code.tar.gz
and.zip
packages generated by GitHub, so we recommend thebitextor-v8.0.zip
tarball or cloning the repov8.0
tag.We will support Bitextor
8.x
branch until the next major version is released.This discussion was created from the release Kill Bill-ingual: Vol. 8.
Beta Was this translation helpful? Give feedback.
All reactions