Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New format for wikipedia ranking files #2767

Open
mtmail opened this issue Jul 14, 2022 · 6 comments
Open

New format for wikipedia ranking files #2767

mtmail opened this issue Jul 14, 2022 · 6 comments

Comments

@mtmail
Copy link
Collaborator

mtmail commented Jul 14, 2022

https://nominatim.org/release-docs/latest/admin/Import/#wikipediawikidata-rankings

The https://github.com/osm-search/wikipedia-wikidata project can be run regularly again. Takes about 18h and 350 GB disc space (of which 300GB the postgresql database). There is potential to make it faster, consume less disc space and add more languages over time.

The current output is three files

  1. wikipedia_importance.sql.gz (900MB) which matches what what Nominatim used since at least 2019 (the first 8 columns were used since 2013 I believe)
  2. wikipedia_article.csv.gz (300MB) with columns
    • language
    • title
    • langcount
    • othercount
    • totalcount
    • lat -- empty for 20% of rows
    • lon
    • importance -- 0 for 5% of rows
    • title_en -- always empty
    • osm_type -- always empty
    • osm_id -- always empty
    • wd_page_title -- actually contains wikidata ids, e.g. Q50826796, empty for <1% of rows
    • instance_of -- also a wikidata id
  3. wikipedia_redirect.csv.gz (550MB) with columns
    • language
    • from_title
    • to_title

My suggestions are

  1. Nominatim to use the CSV files because they would allow users to cut down the data to the languages they need
  2. Remove all columns that are only used for debugging and move them into a separate output file (e.g. -debug). For example langcount is used to create the importance.
  3. remove Q from the wikidata ids. That would all the database to store the value numeric (less disc usage)

Any ideas to cut down the data size or restructure the files are welcome.

@lonvia
Copy link
Member

lonvia commented Jul 15, 2022

Nominatim currently uses the following columns of wikipedia_article: language, title, importance, wd_page_title. All other columns could be removed. This would cut down the table size by about 30%.

We should also think about uniting the two separate tables. If we cut down wikipedia_article to only 4 columns, the most simple way to a single table would be to add full columns for all redirections. Put together, my suggestion for the final table would be:

CREATE TABLE wikimedia_importance AS
  ((SELECT language, title, importance, wd_page_title FROM wikipedia_article WHERE importance != 0)
    UNION
   (SELECT r.language, r.from_title, a.importance, a.wd_page_title FROM wikipedia_article a, wikipedia_redirect r WHERE a.language = r.language and a.title = r.to_title and a.importance != 0)
  )

Such a table would make the search for wikipedia matches significantly easier.

Also, once we have only one table, moving to CSV directly is possible. With multiple table we would need multiple files and I'm strongly against that.

I'm skeptical about converting wikidata IDs into an int. They are strings with the Q prefix in OSM, so the query would have to convert the value first before looking up the data. That is a bit of a pain in SQL. As it is now, we can do a simple string comparison. The conversion saves about 10% space both for table and index.

@lonvia
Copy link
Member

lonvia commented Sep 14, 2022

There is now a preliminary version of the importer for the new format at https://github.com/lonvia/Nominatim/tree/new-wikimedia-tables if you want to try the files out. The table is about half the size, indexes are slightly smaller. The lookup code has become a lot easier, (Although that's for a large part because I have dumped the ability to parse wikipedia URLs. That is luckily a thing of the past.)

@mtmail
Copy link
Collaborator Author

mtmail commented Apr 9, 2024

The wikimedia_importance.csv.gz is regularly created at https://downloads.opencagedata.com/public/wikimedia_importance/

@lonvia
Copy link
Member

lonvia commented May 5, 2024

Two small things I noticed with the new files: 1. They should get a header line, so that we can change the format in the future. 2. We should define the expected delimiters and quote chars. delimiter='\t', quotechar='|' was what worked for me in the end. The documentation on https://github.com/osm-search/wikipedia-wikidata uses a different delimiter.

@mtmail
Copy link
Collaborator Author

mtmail commented May 5, 2024

The csv file will now contain a header row. (osm-search/wikipedia-wikidata@f798ce5). May 2024 files will be released sometime this week.
If you prefer we can also change the output to be CSV (comma delimited, "-quoted).

@lonvia
Copy link
Member

lonvia commented May 6, 2024

If you prefer we can also change the output to be CSV (comma delimited, "-quoted).

The current output is fine. I was more thinking that we need to properly document it, both in https://github.com/osm-search/wikipedia-wikidata and when describing the format in the Nominatim docs in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants