-
-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New format for wikipedia ranking files #2767
Comments
Nominatim currently uses the following columns of wikipedia_article: language, title, importance, wd_page_title. All other columns could be removed. This would cut down the table size by about 30%. We should also think about uniting the two separate tables. If we cut down wikipedia_article to only 4 columns, the most simple way to a single table would be to add full columns for all redirections. Put together, my suggestion for the final table would be:
Such a table would make the search for wikipedia matches significantly easier. Also, once we have only one table, moving to CSV directly is possible. With multiple table we would need multiple files and I'm strongly against that. I'm skeptical about converting wikidata IDs into an int. They are strings with the Q prefix in OSM, so the query would have to convert the value first before looking up the data. That is a bit of a pain in SQL. As it is now, we can do a simple string comparison. The conversion saves about 10% space both for table and index. |
There is now a preliminary version of the importer for the new format at https://github.com/lonvia/Nominatim/tree/new-wikimedia-tables if you want to try the files out. The table is about half the size, indexes are slightly smaller. The lookup code has become a lot easier, (Although that's for a large part because I have dumped the ability to parse wikipedia URLs. That is luckily a thing of the past.) |
The |
Two small things I noticed with the new files: 1. They should get a header line, so that we can change the format in the future. 2. We should define the expected delimiters and quote chars. |
The csv file will now contain a header row. (osm-search/wikipedia-wikidata@f798ce5). May 2024 files will be released sometime this week. |
The current output is fine. I was more thinking that we need to properly document it, both in https://github.com/osm-search/wikipedia-wikidata and when describing the format in the Nominatim docs in the future. |
https://nominatim.org/release-docs/latest/admin/Import/#wikipediawikidata-rankings
The https://github.com/osm-search/wikipedia-wikidata project can be run regularly again. Takes about 18h and 350 GB disc space (of which 300GB the postgresql database). There is potential to make it faster, consume less disc space and add more languages over time.
The current output is three files
wikipedia_importance.sql.gz
(900MB) which matches what what Nominatim used since at least 2019 (the first 8 columns were used since 2013 I believe)wikipedia_article.csv.gz
(300MB) with columnswikipedia_redirect.csv.gz
(550MB) with columnsMy suggestions are
-debug
). For examplelangcount
is used to create theimportance
.Q
from the wikidata ids. That would all the database to store the value numeric (less disc usage)Any ideas to cut down the data size or restructure the files are welcome.
The text was updated successfully, but these errors were encountered: