WikiAnc

A program for generating the WikiAnc dataset.

Usage

You can find the pregenerated dataset on Huggingface (March 1, 2023):

If you want to regenerate the dataset with fresh Wikipedia/Wikidata dumps, you can build wikianc from source by running the following command:

cargo build --release

NOTE: The program uses language specific filtering (i.e., the word "file") which only supports Croatian and English out of the box. Replace the relevant part in the parse_links function to properly support your language.

wikianc uses the mappings between Wikipedia titles and Wikidata QIDs generated by wiki2qid. Follow the instructions to generate the Apache Avro file containing the mappings first.

It also uses a Wikipedia dump in an ndjson format which can be generated by following the instructions here.

Once you have the necessary data, you can generate the dataset with the following command:

cargo run --release -- \
        --input-wiki "${WIKIPEDIA_NDJSON_FILE}" \
        --input-wiki2qid "${MAPPINGS_FILE}" \
        --output-dir "${OUTPUT_DIR}"

This will create 3 files named train.parquet, validation.parquet, and test.parquet in the directory specified by ${OUTPUT_DIR}.

The outputs are written into zstd compressed Apache Parquet files. You can see the details of the schema on Huggingface.

Performance

WikiAnc uses as many threads as there are logical CPU cores. On the English dump from March 2023, containing ~6,600,000 articles, it takes ~11 minutes to complete with peak memory usage of ~52GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
helpers		helpers
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiAnc

Usage

Performance

About

Languages

License

cyanic-selkie/wikianc

Folders and files

Latest commit

History

Repository files navigation

WikiAnc

Usage

Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Languages