Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial set-up #12

Open
soliviantar opened this issue Oct 20, 2023 · 3 comments
Open

Initial set-up #12

soliviantar opened this issue Oct 20, 2023 · 3 comments

Comments

@soliviantar
Copy link

Hi. I am trying to get a dictionary from the eswiktionary dump. But I am a stranger to coding, so I am probably doing lots of stuff wrong.

I downloaded the dump and created the executable, but I get an error every time I run it. I think I'm not setting the Settings.toml file correctly or that maybe I should be putting it somewhere else.

This is the output of the executable (in PowerShell 7, as admin):

PS D:\IDM\dictionary-builder-master\target\release> .\dictionary-builder.exe
dictionnary-builder will use D:\IDM\dictionary-builder-master\target\release\dump\eswiktionary-latest-pages-articles-multistream.xml
thread 'main' panicked at src\main.rs:59:54:
Unable to create file: Os { code: 5, kind: PermissionDenied, message: "Acceso denegado." }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PS D:\IDM\dictionary-builder-master\target\release> .\dictionary-builder.exe RUST_BACKTRACE=1
thread 'main' panicked at src\main.rs:25:79:
called `Result::unwrap()` on an `Err` value: configuration file "RUST_BACKTRACE=1Settings.toml" not found
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PS D:\IDM\dictionary-builder-master\target\release>`

This is my Settings.toml (which I put inside the \release\ folder now):

root="D:\\IDM\\dictionary-builder-master\\target\\release\\dico"
words_file="D:\\IDM\\dictionary-builder-master\\target\\release\\dico\\words"
excluded_words_file="D:\\IDM\\dictionary-builder-master\\target\\release\\dico\\excluded"
xml_dump="D:\\IDM\\dictionary-builder-master\\target\\release\\dump\\eswiktionary-latest-pages-articles-multistream.xml"
with_definition = true
expression = true
language_filter = true
language = "Spanish"
language_short = "es"

Any help would be appreciated.

@newca12
Copy link
Owner

newca12 commented Oct 21, 2023

You are not doing anything wrong, there are just missing/misleading instructions in the readme, the root dico folder must be created by you before running the program, I've added that to the readme. By the way I have also add a fix to deal properly with the spanish dump. Make sure you update your program with it. And last but not least I had also add a warning section to precise what can be expected from dictionnary-builder to avoid disappointments. If all goes well, with the latest eswiktionary-latest-pages-articles-multistream.xml you should end up with :
[INFO dictionary_builder] total number of entries:819815
[INFO dictionary_builder] total number of removed entries:141492`

@soliviantar
Copy link
Author

Oh, ok, thanks! I will try that then. I had created the disco folder somewhere, I believe. I'll check it again.

Also, having an example of some extracted data on the readme would be good.

By the way, what does "expression" in the settings mean?

@newca12
Copy link
Owner

newca12 commented Oct 21, 2023

Basically if expression is set to true an entry in the dump (a potential word) with a space in it will be considered as an expression which can be very wrong with some languages. If set to false all these entries with spaces are simply discarded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants