normalize-nodes --id-column is broken #699

joelb-git · 2023-03-13T16:44:46Z

$ mkdir tmp
$ cd tmp
$ wget -P data/imdb https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/imdb/IMDB.csv.gz

$ zcat IMDB.csv.gz | head -2
imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,None,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey Depew",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0

First issue:

normalize-nodes does not recognize --id-column option

https://kgtk.readthedocs.io/en/latest/transform/normalize_nodes/

  --id-column ID_COLUMN_NAME
                        The name of the ID column. (default=id or alias)

$ kgtk normalize-nodes --id-column imdb_title_id -i IMDB.csv.gz -o out.tsv
In input header 'imdb_title_id  title   original_title  year    date_published  genre   duration        country language        director        writer  production_company      actors  description     avg_vote        votes   budget  usa_gross_income        worlwide_gross_income   metascore       reviews_from_users      reviews_from_critics': Missing required column: id | ID
/Users/joelb/views/kgtk/kgtk/exceptions.py:90: UserWarning: Please raise KGTKException instead of <class 'SystemExit'>
  warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found

Second issue:

After manually changing the column name imdb_title_id to id, there
is still an error:

$ zcat IMDB.csv.gz | perl -pe 's/^imdb_title_id/id/' >IMDB.id.csv
$ kgtk normalize-nodes -i IMDB.id.csv -o out.tsv
In input header 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics':
Warning: Column name 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics' contains a comma (,)
In input header 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics': Missing required column: id | ID
/Users/joelb/views/kgtk/kgtk/exceptions.py:90: UserWarning: Please raise KGTKException instead of <class 'SystemExit'>
  warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found

-v shows that the issue is the extra filename component.

$ kgtk normalize-nodes -v -i IMDB.id.csv -o out.tsv
Starting normalize_nodes pid=74318
Opening the input file: IMDB.id.csv
input format: kgtk  <---  should be csv

The code is trying to detect compression suffixes like foo.csv.gz.
It sees there is no gz but then it mistakenly defaults to kgtk.

Workarounds: name the file as foo.csv with a single dot or pass
--input-format csv.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize-nodes --id-column is broken #699

normalize-nodes --id-column is broken #699

joelb-git commented Mar 13, 2023

normalize-nodes --id-column is broken #699

normalize-nodes --id-column is broken #699

Comments

joelb-git commented Mar 13, 2023