Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not releasing /tmp #682

Open
seyedahbr opened this issue Nov 14, 2022 · 4 comments
Open

Not releasing /tmp #682

seyedahbr opened this issue Nov 14, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@seyedahbr
Copy link

Describe the bug
I used the following command over one of the 2022 Wikidata JSON dumps and I didn't have enough free space on my disk for the output *.tsv files

kgtk --debug --timing --progress import-wikidata -i wikidata-20220103-all.json.gz --node nodefile.tsv --edge edgefile.tsv --qual qualfile.tsv --use-mgzip-for-input True --use-mgzip-for-output True --use-shm True --procs 6 --mapper-batch-size 5 --max-size-per-mapper-queue 3 --single-mapper-queue True --collect-results True --collect-seperately True --collector-batch-size 10 --collector-queue-per-proc-size 3 --progress-interval 500000 --fail-if-missing False

The process has been unsuccessful and the /tmp have been completely occupied with kgtk-graph-cache-sh200.sqlite3.db (about 63 GB). The SQLite file seems to remain after some other successful importing as well.

To Reproduce
Not sure how to reproduce the situation, but I think the problem was due to a lack of free space.

Expected behavior
The /tmp is better to be cleared after either successful or failed importing process

@CraigMiloRogers CraigMiloRogers added the bug Something isn't working label Nov 14, 2022
@CraigMiloRogers CraigMiloRogers added this to To do in KGTK Development via automation Nov 14, 2022
@CraigMiloRogers
Copy link
Collaborator

CraigMiloRogers commented Nov 14, 2022

First, I suggest compressing the output files as they are created by using the extension ".tsv.gz".

@CraigMiloRogers
Copy link
Collaborator

Second, I don't see how the graph cache command would be created by the kgtk import-wikidata command. Perhaps some other commands were run as well?

@seyedahbr
Copy link
Author

Ahh, I have another command that runs after the import. That command is : kgtk query -i edgefile.tsv --match '(n1)-[:P31]->(class), (n1)-[p]->(n2)' --where 'class IN ["Q11173","Q12136","Q7187","Q8054"]' --return 'n1, p, n2' > ./kgtk_output.tsv

@chalypso
Copy link
Collaborator

Working with large data files requires some care. As Craig suggested, make sure the edge file that was produced is compressed to not waste any space.
For Kypher, use the --gc option to direct it to use a graph cache file in a location that has enough available space. For example:
kgtk query --gc /data/wikidata.sqlite3.db ....
See https://github.com/usc-isi-i2/kgtk/blob/dev/docs/transform/query.md#graph-cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

3 participants