Skip to content

cyanic-selkie/wikianc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiAnc

A program for generating the WikiAnc dataset.

License

Usage

You can find the pregenerated dataset on Huggingface (March 1, 2023):

If you want to regenerate the dataset with fresh Wikipedia/Wikidata dumps, you can build wikianc from source by running the following command:

cargo build --release

NOTE: The program uses language specific filtering (i.e., the word "file") which only supports Croatian and English out of the box. Replace the relevant part in the parse_links function to properly support your language.

wikianc uses the mappings between Wikipedia titles and Wikidata QIDs generated by wiki2qid. Follow the instructions to generate the Apache Avro file containing the mappings first.

It also uses a Wikipedia dump in an ndjson format which can be generated by following the instructions here.

Once you have the necessary data, you can generate the dataset with the following command:

cargo run --release -- \
        --input-wiki "${WIKIPEDIA_NDJSON_FILE}" \
        --input-wiki2qid "${MAPPINGS_FILE}" \
        --output-dir "${OUTPUT_DIR}"

This will create 3 files named train.parquet, validation.parquet, and test.parquet in the directory specified by ${OUTPUT_DIR}.

The outputs are written into zstd compressed Apache Parquet files. You can see the details of the schema on Huggingface.

Performance

WikiAnc uses as many threads as there are logical CPU cores. On the English dump from March 2023, containing ~6,600,000 articles, it takes ~11 minutes to complete with peak memory usage of ~52GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.