New format for wikipedia ranking files #2767

mtmail · 2022-07-14T12:23:19Z

https://nominatim.org/release-docs/latest/admin/Import/#wikipediawikidata-rankings

The https://github.com/osm-search/wikipedia-wikidata project can be run regularly again. Takes about 18h and 350 GB disc space (of which 300GB the postgresql database). There is potential to make it faster, consume less disc space and add more languages over time.

The current output is three files

wikipedia_importance.sql.gz (900MB) which matches what what Nominatim used since at least 2019 (the first 8 columns were used since 2013 I believe)
wikipedia_article.csv.gz (300MB) with columns
- language
- title
- langcount
- othercount
- totalcount
- lat -- empty for 20% of rows
- lon
- importance -- 0 for 5% of rows
- title_en -- always empty
- osm_type -- always empty
- osm_id -- always empty
- wd_page_title -- actually contains wikidata ids, e.g. Q50826796, empty for <1% of rows
- instance_of -- also a wikidata id
wikipedia_redirect.csv.gz (550MB) with columns
- language
- from_title
- to_title

My suggestions are

Nominatim to use the CSV files because they would allow users to cut down the data to the languages they need
Remove all columns that are only used for debugging and move them into a separate output file (e.g. -debug). For example langcount is used to create the importance.
remove Q from the wikidata ids. That would all the database to store the value numeric (less disc usage)

Any ideas to cut down the data size or restructure the files are welcome.

The text was updated successfully, but these errors were encountered:

lonvia · 2022-07-15T10:19:15Z

Nominatim currently uses the following columns of wikipedia_article: language, title, importance, wd_page_title. All other columns could be removed. This would cut down the table size by about 30%.

We should also think about uniting the two separate tables. If we cut down wikipedia_article to only 4 columns, the most simple way to a single table would be to add full columns for all redirections. Put together, my suggestion for the final table would be:

CREATE TABLE wikimedia_importance AS
  ((SELECT language, title, importance, wd_page_title FROM wikipedia_article WHERE importance != 0)
    UNION
   (SELECT r.language, r.from_title, a.importance, a.wd_page_title FROM wikipedia_article a, wikipedia_redirect r WHERE a.language = r.language and a.title = r.to_title and a.importance != 0)
  )

Such a table would make the search for wikipedia matches significantly easier.

Also, once we have only one table, moving to CSV directly is possible. With multiple table we would need multiple files and I'm strongly against that.

I'm skeptical about converting wikidata IDs into an int. They are strings with the Q prefix in OSM, so the query would have to convert the value first before looking up the data. That is a bit of a pain in SQL. As it is now, we can do a simple string comparison. The conversion saves about 10% space both for table and index.

lonvia · 2022-09-14T20:10:40Z

There is now a preliminary version of the importer for the new format at https://github.com/lonvia/Nominatim/tree/new-wikimedia-tables if you want to try the files out. The table is about half the size, indexes are slightly smaller. The lookup code has become a lot easier, (Although that's for a large part because I have dumped the ability to parse wikipedia URLs. That is luckily a thing of the past.)

mtmail · 2024-04-09T13:21:45Z

The wikimedia_importance.csv.gz is regularly created at https://downloads.opencagedata.com/public/wikimedia_importance/

lonvia · 2024-05-05T10:34:23Z

Two small things I noticed with the new files: 1. They should get a header line, so that we can change the format in the future. 2. We should define the expected delimiters and quote chars. delimiter='\t', quotechar='|' was what worked for me in the end. The documentation on https://github.com/osm-search/wikipedia-wikidata uses a different delimiter.

mtmail · 2024-05-05T21:44:46Z

The csv file will now contain a header row. (osm-search/wikipedia-wikidata@f798ce5). May 2024 files will be released sometime this week.
If you prefer we can also change the output to be CSV (comma delimited, "-quoted).

lonvia · 2024-05-06T06:50:55Z

If you prefer we can also change the output to be CSV (comma delimited, "-quoted).

The current output is fine. I was more thinking that we need to properly document it, both in https://github.com/osm-search/wikipedia-wikidata and when describing the format in the Nominatim docs in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New format for wikipedia ranking files #2767

New format for wikipedia ranking files #2767

mtmail commented Jul 14, 2022

lonvia commented Jul 15, 2022

lonvia commented Sep 14, 2022

mtmail commented Apr 9, 2024

lonvia commented May 5, 2024

mtmail commented May 5, 2024

lonvia commented May 6, 2024

New format for wikipedia ranking files #2767

New format for wikipedia ranking files #2767

Comments

mtmail commented Jul 14, 2022

lonvia commented Jul 15, 2022

lonvia commented Sep 14, 2022

mtmail commented Apr 9, 2024

lonvia commented May 5, 2024

mtmail commented May 5, 2024

lonvia commented May 6, 2024