-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple English Wiktionary #144
base: master
Are you sure you want to change the base?
Conversation
Here's the log from running on The warning "linkage recurse URL" is especially common, although there are also a few unimplemented templates. |
* pages do not contain separate headers for different languages * every page contains definitions for English words
This fixes ~1,000 warnings caused by the usage of sense tags "transitive & intransitive" and "countable & uncountable".
Add a new field "lists" giving the names of vocab lists that a word belongs to; currently we have the British National Corpus top 1,000, Charles Kay Ogden's Basic English 850 word list, and an academic word list.
Some templates are redirects (wik -> wikipedia, etc.); a proper solution would detect these redirects, but for now we just handle them manually to reduce the warning count.
The previously added logic for parsing Simple EN pages was incomplete; because Simple EN pages only contain data for one language, we cannot read the first header and then only consider the rest part of the entry. We must use the whole page as the entry instead. Redoing this logic helps us not to skip many sections and also fixes an issue where pages with only one section were rendered completely empty. The following was one example: TITLE: almonds ==Noun== {{noun|almond}} #{{plural of|almond}} Making this change also loads many more templates, which uncovered further issues that needed to be fixed in wikitextprocessor. We therefore have to update the branch being installed via pip. The output file went from 20K entries, 10MB to 37K entries (as advertised by Wiktionary's stats page) and 23MB.
The templates used on the Simple EN wiki start with "The ".
b22f342
to
0d4046b
Compare
Create a new parameter "edition" for specifying the language code of the edition of Wiktionary being input. For now, only allow "en" and "simple" editions. Move all Simple Wiktionary-specific logic behind a check for an edition value of "simple".
Copy the tests from `test_page.py` but with the language header removed. Additionally, copy the real page data from the entry for "freeze" just to verify current output. There are several outstanding TODO's.
I've added simple functionality for specifying the Wiktionary edition, currently only allowing "en" and "simple". This is extensible for usage with further editions, so please let me know if this approach is acceptable. I would probably prefer an enum or at least static constants for the edition names, but I'm not sure where to put them yet. I've also added a simple test! It's unclear to me what level of pre-processing/template-expanding is expected to the input of |
I merged major changes from @xxyzz yesterday that relate to support for other Wiktionary editions. He has implemented support for configuring various tables, including namespaces but also a lot else, in configuration files (both on the Wiktextract and Wikitextprocessor sides). Would be any chance to rewrite these changes using the same approach? |
There is now at least some support for parsing the Chinese Wiktionary (I've not had a chance to fully test it yet myself though). I would think supporting the Simple English Wiktionary should be much easier. I also have some personal interest in the Simple English Wiktionary for my other research. |
That's great news! Let me investigate the changes from xxyzz and rebase. Indeed the simple English Wiktionary is quite similar to the English one; the major difference is that each page only contains one language, so the header structure is different (e.g. no language header). BTW I also investigated the parsed files provided by xxyzz. I did see some unexpected structures and issues with translations, forms, examples, etc. but the glosses, the most important data for us, seemed to be good enough to use. I'm not sure if it'll be useful for you, but here's the script I used for checking it. It contains a basic model of the data output by Wiktextract and uses pydantic to validate it. (I have the script because I've taken to formatting other dictionaries the same way.) # Models and validation code for Wiktextract output structure
# For documentation on what the fields mean, see:
# https://github.com/tatuylonen/wiktextract
# CLI USAGE: python3 models.py <file.jsonl> [<file2.jsonl> ...]
import json
import sys
from typing import List, Literal, Mapping, Optional
from pydantic import BaseModel, Extra, Field, root_validator, ValidationError
class WordLink(BaseModel, extra=Extra.forbid):
alt: Optional[str]
english: Optional[str]
roman: Optional[str]
sense: Optional[str]
tags: Optional[List[str]]
taxonomic: Optional[str]
topics: Optional[List[str]]
word: str
# not specified in docs, but is still output by Wiktextract
extra: Optional[str]
class Example(BaseModel, extra=Extra.forbid):
text: str
ref: Optional[str]
english: Optional[str]
type: Optional[str]
roman: Optional[str]
note: Optional[str]
class Translation(BaseModel, extra=Extra.forbid):
alt: Optional[str]
code: str
english: str
lang: str
note: Optional[str]
roman: Optional[str]
sense: Optional[str]
tags: Optional[List[str]]
taxonomic: Optional[str]
word: Optional[str]
@root_validator
def check_word(cls, values):
if "word" not in values:
assert "note" in values, "word can only be missing if a note is present"
return values
class WithWordLinks(BaseModel, extra=Extra.forbid):
alt_of: Optional[List[WordLink]]
form_of: Optional[List[WordLink]]
synonyms: Optional[List[WordLink]]
antonyms: Optional[List[WordLink]]
hypernyms: Optional[List[WordLink]]
holonyms: Optional[List[WordLink]]
meronyms: Optional[List[WordLink]]
hyponyms: Optional[List[WordLink]]
coordinate_terms: Optional[List[WordLink]]
derived: Optional[List[WordLink]]
related: Optional[List[WordLink]]
# not used by Wiktextract, but appear on other edition Wiktionaries
cooccurs_with: Optional[List[WordLink]]
similar: Optional[List[WordLink]]
class Sense(WithWordLinks):
glosses: Optional[List[str]]
raw_glosses: Optional[List[str]]
tags: Optional[List[str]]
categories: Optional[List[str]]
topics: Optional[List[str]]
translations: Optional[List[Translation]]
senseid: Optional[str]
wikidata: Optional[List[str]]
wikipedia: Optional[List[str]]
examples: Optional[List[Example]]
english: Optional[str]
@root_validator
def validate_glosses(cls, values):
if "raw_glosses" in values:
pass
# output from Wiktextract doesn't conform to this
# assert len(values["raw_glosses"]) == len(values["glosses"]), "raw_glosses and glosses must be the same length"
else:
assert (
"no-gloss" in values["tags"]
), "no-gloss tag must be present if no glosses are present"
return values
class Pronunciation(BaseModel, extra=Extra.forbid):
ipa: Optional[str]
enpr: Optional[str]
audio: Optional[str]
ogg_url: Optional[str]
mp3_url: Optional[str]
audio_ipa: Optional[str] = Field(alias="audio-ipa")
homophone: Optional[str]
hyphenation: Optional[List[str]]
tags: Optional[List[str]]
text: Optional[str]
# these are not specified in docs, but are still output by Wiktextract
other: Optional[str]
note: Optional[str]
topics: Optional[List[str]]
@root_validator
def any_field(cls, values):
if not any(values.values()):
raise ValueError("At least one field must be set")
return values
class Form(BaseModel, extra=Extra.forbid):
form: str
tags: Optional[List[str]]
class Template(BaseModel, extra=Extra.forbid):
name: str
args: Mapping[str, str]
expansion: str
POS = Literal[
"abbrev",
"adj_noun",
"adj_verb",
"adj",
"adv_phrase",
"adv",
"affix",
"ambiposition",
"article",
"character",
"circumfix",
"circumpos",
"classifier",
"clause",
"combining_form",
"conj",
"converb",
"counter",
"det",
"infix",
"interfix",
"intj",
"name",
"noun",
"num",
"particle",
"phrase",
"postp",
"prefix",
"prep",
"preverb",
"pron",
"proverb",
"punct",
"romanization",
"root",
"suffix",
"syllable",
"symbol",
"verb",
# not used by Wiktextract, but appear on other edition Wiktionaries
"prep_phrase",
"noun_phrase",
"adj_phrase",
"verb_phrase"
]
class Entry(WithWordLinks):
word: str
pos: POS
# code for the language the word belongs to
lang_code: str
# Name of the language corresponding to lang_code
# (as it appears on Wiktionary, e.g. may or may not be an English word)
lang: str
senses: List[Sense]
forms: Optional[List[Form]]
sounds: Optional[List[Pronunciation]]
categories: Optional[List[str]]
topics: Optional[List[str]]
translations: Optional[List[Translation]]
etymology_text: Optional[str]
etymology_templates: Optional[List[Template]]
wikidata: Optional[List[str]]
wiktionary: Optional[str]
head_templates: Optional[List[Template]]
inflection_templates: Optional[List[Template]]
# not specified in docs, but is still output by Wiktextract
wikipedia: Optional[List[str]]
def validate(data):
if isinstance(data, str):
data = json.loads(data)
# Wiktextract output contains JSON lines for templates and other non-entries
if "word" not in data and "title" in data:
title = data["title"]
print(f"Skipping non-word page: {title}", file=sys.stderr)
return
return Entry.parse_obj(data)
if __name__ == "__main__":
if len(sys.argv) < 2:
print(
"USAGE: python3 models.py <file.jsonl> [file2.jsonl ...]", file=sys.stderr
)
sys.exit(1)
for filename in sys.argv[1:]:
entry_count = 0
with open(filename) as f:
line_num = 0
for line in f:
line_num += 1
try:
entry = validate(line)
if entry:
entry_count += 1
except ValidationError as e:
print(f"Error on line {line_num}: {line}\n{e}")
exit()
if line_num % 1000 == 0:
print(".", end="", flush=True)
print(f"\n✨ Validation of {filename} completed successfully! ✨")
print(f"📊 Total entries: {entry_count} 📊") |
This is a POC and cannot be merged as-is. It depends on a PR branch of wikitextprocessor, uses a hardcoded boolean to indicate that we are processing Simple EN text, doesn't have any tests, lots of warnings are printed while running, and probably more. However, Simple English resources are really important for ESL users, so I'm publishing this just so others know this is possible.