-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few missing words: "used to", "believe in", "ease", "abdomen" #536
Comments
Some are redirect pages like "feel up to", "believe in" and "take credit for", maybe the kaikki website doesn't display them anymore but all of the pages you listed should be included in the raw JSONL file. |
I just searched the entire English JSON output for the text
If "used to" was in the output, wouldn't this text match? If you can find it, could you please provide me with the ID of one of the senses, or some other way to find it in the 1.5GB output JSON? Thanks! |
I checked the raw JSONL file on kaikki.org and only the three redirect pages are in it, all the other pages are missing. But I run the |
I extract the latest dump file locally and the created JSONL file contains these pages. @kristian-clausal could you please update the files on kaikki.org? Maybe they will be included in new files. |
Tatu is currently handling everything kaikki.org and he's away at the moment. He's trying to get things stabilized again, and I'm not going to poke at it until it's ready. |
Looking good so far. Gonna check each of the words above as soon as my pipeline has finished processing. |
The following words are still missing: believe in These are Wiktionary redirect pages, as @xxyzz correctly stated. Does WiktExtract not generate entries for them? And this word is missing currently, and it isn't a redirect. Not sure whether it was already missing last month: hieroglyph https://kaikki.org/dictionary/English/meaning/h/hi/hieroglyph.html |
Redirect pages are in the raw JSON data file and have data defined at here: wiktextract/src/wiktextract/wiktionary.py Lines 51 to 57 in e77dfb6
Page hieroglyph has a not closed |
Someone's already fixed it on Wiktionary, so hieroglyph should appear with the next data dump from them. Sometimes there's something actually wrong with the wiki source (unlike the usual, which is our fault), and the issue with open HTML tags is very difficult. How do you know when you're "supposed" to close something? Even Wiktionary doesn't have some kind of way of closing the tag. It doesn't show in the result (the div doesn't do anything much), but it messes up our parse tree. |
Thanks for the analysis guys. Yes I agree, if the Wiktionary data is malformatted, there's nothing much you can / should do. Maybe you could output a warning, and maybe someone working on Wiktionary will gladly accept those warnings? But I agree, you should not try to "close HTML tags" yourself. |
There should be a debug message says the |
Hmm, that's the whole debug log, right? And, searching for "hieroglyph" does not yield anything. Just so we don't misunderstand each other: I'm so grateful for this library and all the work you guys are putting into. Up to now, every ticket of mine has been answered in less than 10 hours, which by no means is the usual case for open source projects. So kudos to you guys! But just in case you find time to do this, it might be helpful to create a separate log for those errors that are clearly technical problems in Wiktionary. Like |
Huh, English is the only language with the form "hieroglyph". No wonder there's no page for it, and it doesn't show up in the debug logs; I think the debug info might be completely discarded because there's no 'page' to attach it to. That's problematic... Oh, and this is a pet-project by someone who funds it. Honestly, we should be answering things quicker. :p |
Oh, Tatu funds it? Or someone else? Didn't know that :) |
These words were present in a previous version of the kaikki.org/dictionary outputs, but today they are missing, even though they are present in the English Wiktionary (some with their own entry, others not):
used to
believe in
ease
abdomen
shithole
guinea pig
take credit for
feel up to
pomelo
For the example "believe in", please note that "believes in" and "believed in" are present, but "believe in" isn't.
Would be nice to see at least some of them back in the output.
The text was updated successfully, but these errors were encountered: