Wikimedia To Text Corpus

Wikimedia is the driving force behind Wikipedia. They provide a monthly full backup of all the data on Wikipedia as well as their properties. The purpose of this repo is to convert the Wikimedia dump from the given format into the text corpus format we use. I.E.

The full corpus consisting of one or more TXT files in a single folder
One or more articles in a single TXT file
Each article will have a header in the form "--- {id} ---"
Each article will have its abstract and body extracted
One sentence per line
Paragraphs are separated by a blank line

Operation

Install

You can install the package using the following steps:

pip install using an admin prompt.

pip uninstall wikimedia
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/wikimedia.git

or if you have the code local

pip uninstall wikimedia
python -OO -m pip install -v c:/repos/TextCorpusLabs/wikimedia

Run

You are responsible for getting the source files. They can be found at this site. You will need to further navigate into particular wiki you want to download.

You are responsible for un-compressing and validating the source files. I recommend using 7zip. I installed my copy using Chocolatey.

The reason you are responsible is because the dump files are a single MASSIVE file. Sometimes Wikimedia will be busy and the download will be slow. Modern browsers support resume for exactly this case. As of 2023/01/22 it is over 90 GB in .xml form. You must make sure you have enough space before you start.

All the below commands assume the corpus is an extracted .xml file.

Extracts the metadata from the corpus.

wikimedia metadata -source d:/data/wiki/enwiki.xml -dest d:/data/wiki/enwiki.meta.csv

The following are required parameters:

source is the .xml file sourced from Wikimedia.
dest is the CSV file used to store the metadata.

The following are optional parameters:

log is the folder of raw XML chunks that did not process. It defaults to empty (not saved).

Convert the data to our standard format.

wikimedia convert -source d:/data/wiki/enwiki.xml -dest d:/data/wiki.std

The following are required parameters:

source is the .xml file sourced from Wikimedia.
dest is the folder for the converted TXT files.

The following are optional parameters:

lines is the number of lines per TXT file. The default is 1000000.
dest_pattern is the format of the TXT file name. It defaults to wikimedia.{id:04}.txt. id is an increasing value that increments after lines are stored in a file.
log is the folder of raw XML chunks that did not process. It defaults to empty (not saved).

Debug/Test

The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).

pip uninstall wikimedia
python -m pip install -e c:/repos/TextCorpusLabs/wikimedia

Academic boilerplate

Below is the suggested text to add to the "Methods and Materials" section of your paper when using this process. The references can be found here

The 2022/10/01 English version of Wikipedia [@wikipedia2020] was downloaded using Wikimedia's download service [@wikimedia2020]. The single-file data dump was then converted to a corpus of plain text articles using the process described in [@wikicorpus2020].

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
src/wikimedia		src/wikimedia
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikimedia To Text Corpus

Operation

Install

Run

Debug/Test

Academic boilerplate

About

Languages

License

TextCorpusLabs/wikimedia

Folders and files

Latest commit

History

Repository files navigation

Wikimedia To Text Corpus

Operation

Install

Run

Debug/Test

Academic boilerplate

About

Topics

Resources

License

Stars

Watchers

Forks

Languages