Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize-nodes --id-column is broken #699

Open
joelb-git opened this issue Mar 13, 2023 · 0 comments
Open

normalize-nodes --id-column is broken #699

joelb-git opened this issue Mar 13, 2023 · 0 comments

Comments

@joelb-git
Copy link

$ mkdir tmp
$ cd tmp
$ wget -P data/imdb https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/imdb/IMDB.csv.gz

$ zcat IMDB.csv.gz | head -2
imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,None,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey Depew",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0

First issue:

normalize-nodes does not recognize --id-column option

https://kgtk.readthedocs.io/en/latest/transform/normalize_nodes/

  --id-column ID_COLUMN_NAME
                        The name of the ID column. (default=id or alias)
$ kgtk normalize-nodes --id-column imdb_title_id -i IMDB.csv.gz -o out.tsv
In input header 'imdb_title_id  title   original_title  year    date_published  genre   duration        country language        director        writer  production_company      actors  description     avg_vote        votes   budget  usa_gross_income        worlwide_gross_income   metascore       reviews_from_users      reviews_from_critics': Missing required column: id | ID
/Users/joelb/views/kgtk/kgtk/exceptions.py:90: UserWarning: Please raise KGTKException instead of <class 'SystemExit'>
  warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found

Second issue:

After manually changing the column name imdb_title_id to id, there
is still an error:

$ zcat IMDB.csv.gz | perl -pe 's/^imdb_title_id/id/' >IMDB.id.csv
$ kgtk normalize-nodes -i IMDB.id.csv -o out.tsv
In input header 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics':
Warning: Column name 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics' contains a comma (,)
In input header 'id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics': Missing required column: id | ID
/Users/joelb/views/kgtk/kgtk/exceptions.py:90: UserWarning: Please raise KGTKException instead of <class 'SystemExit'>
  warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found

-v shows that the issue is the extra filename component.

$ kgtk normalize-nodes -v -i IMDB.id.csv -o out.tsv
Starting normalize_nodes pid=74318
Opening the input file: IMDB.id.csv
input format: kgtk  <---  should be csv

The code is trying to detect compression suffixes like foo.csv.gz.
It sees there is no gz but then it mistakenly defaults to kgtk.

Workarounds: name the file as foo.csv with a single dot or pass
--input-format csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant