Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few missing words: "used to", "believe in", "ease", "abdomen" #536

Open
yolpsoftware opened this issue Mar 12, 2024 · 14 comments
Open

A few missing words: "used to", "believe in", "ease", "abdomen" #536

yolpsoftware opened this issue Mar 12, 2024 · 14 comments

Comments

@yolpsoftware
Copy link

These words were present in a previous version of the kaikki.org/dictionary outputs, but today they are missing, even though they are present in the English Wiktionary (some with their own entry, others not):

used to
believe in
ease
abdomen
shithole
guinea pig
take credit for
feel up to
pomelo

For the example "believe in", please note that "believes in" and "believed in" are present, but "believe in" isn't.

Would be nice to see at least some of them back in the output.

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 12, 2024

Some are redirect pages like "feel up to", "believe in" and "take credit for", maybe the kaikki website doesn't display them anymore but all of the pages you listed should be included in the raw JSONL file.

@yolpsoftware
Copy link
Author

I just searched the entire English JSON output for the text

"id": "en-used_to-en-

If "used to" was in the output, wouldn't this text match?

If you can find it, could you please provide me with the ID of one of the senses, or some other way to find it in the 1.5GB output JSON? Thanks!

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 12, 2024

I checked the raw JSONL file on kaikki.org and only the three redirect pages are in it, all the other pages are missing. But I run the wikwords command with these pages(--page option), and the JSON file could be created successfully, I'm not sure why they are missing in the raw JSON file.

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 13, 2024

I extract the latest dump file locally and the created JSONL file contains these pages. @kristian-clausal could you please update the files on kaikki.org? Maybe they will be included in new files.

@kristian-clausal
Copy link
Collaborator

Tatu is currently handling everything kaikki.org and he's away at the moment. He's trying to get things stabilized again, and I'm not going to poke at it until it's ready.

@yolpsoftware
Copy link
Author

Looking good so far. Gonna check each of the words above as soon as my pipeline has finished processing.

@yolpsoftware
Copy link
Author

The following words are still missing:

believe in
take credit for

These are Wiktionary redirect pages, as @xxyzz correctly stated. Does WiktExtract not generate entries for them?

And this word is missing currently, and it isn't a redirect. Not sure whether it was already missing last month:

hieroglyph

https://kaikki.org/dictionary/English/meaning/h/hi/hieroglyph.html
https://kaikki.org/dictionary/English/words/hierogamy--hierogrammatists.html
https://en.wiktionary.org/wiki/hieroglyph

@xxyzz
Copy link
Collaborator

xxyzz commented Apr 11, 2024

Redirect pages are in the raw JSON data file and have data defined at here:

page_data = [
{
"title": title,
"redirect": page.redirect_to,
"pos": "hard-redirect",
}
]

Page hieroglyph has a not closed <div> tag that moves all the following nodes inside the HTML div tag.

@kristian-clausal
Copy link
Collaborator

Someone's already fixed it on Wiktionary, so hieroglyph should appear with the next data dump from them. Sometimes there's something actually wrong with the wiki source (unlike the usual, which is our fault), and the issue with open HTML tags is very difficult. How do you know when you're "supposed" to close something?

Screenshot at 2024-04-11 07-42-42

Even Wiktionary doesn't have some kind of way of closing the tag. It doesn't show in the result (the div doesn't do anything much), but it messes up our parse tree.

@yolpsoftware
Copy link
Author

Thanks for the analysis guys.

Yes I agree, if the Wiktionary data is malformatted, there's nothing much you can / should do. Maybe you could output a warning, and maybe someone working on Wiktionary will gladly accept those warnings? But I agree, you should not try to "close HTML tags" yourself.

@xxyzz
Copy link
Collaborator

xxyzz commented Apr 11, 2024

There should be a debug message says the <div> is not closed somewhere here: https://kaikki.org/dictionary/errors/debug.html

@yolpsoftware
Copy link
Author

Hmm, that's the whole debug log, right? And, searching for "hieroglyph" does not yield anything.

Just so we don't misunderstand each other: I'm so grateful for this library and all the work you guys are putting into. Up to now, every ticket of mine has been answered in less than 10 hours, which by no means is the usual case for open source projects. So kudos to you guys!

But just in case you find time to do this, it might be helpful to create a separate log for those errors that are clearly technical problems in Wiktionary. Like <div>s without a closing </div>. And I suppose there are Wiktionary people pretty interested in such a log, if the log is easily readable.

@kristian-clausal
Copy link
Collaborator

Huh, English is the only language with the form "hieroglyph". No wonder there's no page for it, and it doesn't show up in the debug logs; I think the debug info might be completely discarded because there's no 'page' to attach it to. That's problematic...

Oh, and this is a pet-project by someone who funds it. Honestly, we should be answering things quicker. :p

@yolpsoftware
Copy link
Author

Oh, Tatu funds it? Or someone else? Didn't know that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants