Skip to content

H0radricCube/WikiCorpusExtractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiCorpusExtractor is a python library for creating corpora from Wikipedia XML dump files. The target audience are people which need a collection of texts for Language Processing tools.

The output of this library is a text file of the form:

<doc id="xx" title="Autism">
Text which is tokenized , i.e., words and punctuation are separated by a space .
Some special words like step-by-step or U.S.A. are correctly handled .
</doc>
<doc id="xxx" title="zzz">
...
</doc>

Usage for building an English corpus (search in the other Wikipedias for other languages)

DOWNLOAD XML DUMP FILE

CREATE A CORPUS FROM THE XML DUMP FILE (Python example)

from wikiXMLDump import WikiXMLDumpFile

if __name__ == "__main__":

    # Sources
    enSource = 'Resources/sources/EN_Medicine_depth2.xml.bz2'

    # Create object
    wk = WikiXMLDumpFile(enSource)
    # Show a document
    wkDoc = wk.getWikiDocumentByTitle('Abortion')
    print wkDoc
    # Print portuguese translation of the title (if available)
    print wkDoc.getTranslatedTitle('pt')
    # Clean wikipedia markup and tokenize the text
    wkDoc.cleanText()
    wkDoc.tokenizeText(forceLowerCase=True)  # True makes all words lowercase
    print wkDoc
    # Create a corpus of about 4M words and a minimum of about 500 words per document
    wk.createCorpus(filename='Resources/corpora/EN_Medicin_corpora.txt',
                    minWordsByDoc=500, maxWords=4000000, forceLowerCase=False)

Enjoy! :)

About

Extracts text from WikiMedia XML Dump files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%