Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't import Wiki Data - Either becomes Idle without finishing or using resources, or throws a DEADLOCK IMMANENT error #695

Open
brett--anderson opened this issue Feb 9, 2023 · 6 comments
Assignees

Comments

@brett--anderson
Copy link

I'm trying to import the full wiki data zip, possibly just the English attributes, into the KGTK format for further analysis. The process runs for a few hours. I see from terminal that some of the processes have got up to 1.4 million lines processed (not sure of how many). While running I'm watching the system resources and seeing several kgtk processes using most of the machines memory between them. The number of kgtk processes drops over time. Now there are only two kgtk processes, neither using even 1% of memory and no CPU activity. It seems to have effectively stopped, yet the process is still displaying the last output of lines processed. So it seems it's still running, but it's ceased to do anything. It's been in this state for at least an hour.

To Reproduce
Installed KGTK under python 3.9.15 in local conda env
Downloaded zip of Wiki Data, ~70GB compressed, within the last 12 months.
activate the conda env
Run the command:

kgtk  --debug --timing --progress import-wikidata \
        -i latest-all.json.bz2 \
        --node nodefile.tsv \
        --edge edgefile.tsv \
        --qual qualfile.tsv \
        --use-mgzip-for-input True \
        --use-mgzip-for-output True \
        --use-shm True \
        --procs 6 \
        --mapper-batch-size 5 \
        --max-size-per-mapper-queue 3 \
        --single-mapper-queue True \
        --collector-batch-size 10 \
        --collector-queue-per-proc-size 3 \
        --progress-interval 50000 --fail-if-missing False

Expected behavior
The process to continue running and using system resources to indicates it's doing something, until all the wiki data has been converted to the TSV format, or some useful error is thrown.

  • OS: AWS EC2 t3.2xlarge instance: Ubuntu 22.04 LTS (32 MB memory, 10GB swap, 2TB HD, 8 vCPUs)
  • KGTK 1.5.0 (installed with conda in local venv)
  • Python 3.9.15

Additional context
Not sure if the problem is caused by memory leaks or a deadlock issue 🤷‍♂️
After manually killing the process (Ctrl+C) The output throws an error:
This three times:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs) 
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/kgtk/cli/import_wikidata.py", line 1897, in run
    action, nrows, erows, qrows, invalid_erows, invalid_qrows, header = collector_q.get()
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 809, in get
    src_pid, msg_id, block_id, total_chunks, next_chunk_block_id = self.next_readable_msg(block, remaining_timeout) # This call might raise Empty.
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 602, in next_readable_msg
    block_id: typing.Optional[int] = self.get_first_msg(block=block, timeout=remaining_timeout)
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 474, in get_first_msg
    self.msg_list_semaphore.acquire(block=block, timeout=timeout)
KeyboardInterrupt

Followed By this:
/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 76 leaked shared_memory objects to clean up at shutdown

@brett--anderson
Copy link
Author

I tried running again after upgrading KGTK to 1.5.2. I used exactly the same command except this time I set -procs to 1.

I had several KGTK processes, one holding 96% or so of the memory and occasionally flaring up to 100%. Then rather than the silence of before it actually threw and error:

2850000 lines processed by processor 0
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/kgtk/cli/import_wikidata.py", line 2636, in run
    pp.add_task(line)
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/parallel_processor.py", line 383, in add_task
    self._add_task(self.batch_data)
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/parallel_processor.py", line 388, in _add_task
    self.mapper_queues[0].put((ParallelProcessor.CMD_DATA, batched_args))
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 671, in put
    raise ValueError("DEADLOCK IMMANENT: qid=%d src_pid=%d: total_chunks=%d > maxsize=%d" % (self.qid, src_pid, total_chunks, self.maxsize))
ValueError: DEADLOCK IMMANENT: qid=3 src_pid=3146: total_chunks=4 > maxsize=3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/kgtk/exceptions.py", line 70, in __call__
    return_code = func(*args, **kwargs) or 0
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/kgtk/cli/import_wikidata.py", line 2732, in run
    raise KGTKException(str(e))
kgtk.exceptions.KGTKException: DEADLOCK IMMANENT: qid=3 src_pid=3146: total_chunks=4 > maxsize=3
DEADLOCK IMMANENT: qid=3 src_pid=3146: total_chunks=4 > maxsize=3

@brett--anderson brett--anderson changed the title Can't import Wiki Date - Seems to stop doing any work after a few hours with no errors thrown or process ending Can't import Wiki Data - Seems to stop doing any work after a few hours with no errors thrown or process ending Feb 9, 2023
@brett--anderson brett--anderson changed the title Can't import Wiki Data - Seems to stop doing any work after a few hours with no errors thrown or process ending Can't import Wiki Data - Either become Idle without finishing or using resource, or throws a DEADLOCK IMMANENT error Feb 9, 2023
@brett--anderson brett--anderson changed the title Can't import Wiki Data - Either become Idle without finishing or using resource, or throws a DEADLOCK IMMANENT error Can't import Wiki Data - Either becomes Idle without finishing or using resource, or throws a DEADLOCK IMMANENT error Feb 9, 2023
@brett--anderson brett--anderson changed the title Can't import Wiki Data - Either becomes Idle without finishing or using resource, or throws a DEADLOCK IMMANENT error Can't import Wiki Data - Either becomes Idle without finishing or using resources, or throws a DEADLOCK IMMANENT error Feb 9, 2023
@brett--anderson
Copy link
Author

I tried again with two processes. Same result as my first post. I did notice that the two processes spawned to do the work eventually became zombies, from top command:

   3263 ubuntu    20   0       0      0      0 Z   0.0   0.0  19:19.40 kgtk
   3264 ubuntu    20   0       0      0      0 Z   0.0   0.0  29:54.03 kgtk

@filievski
Copy link
Contributor

Hi Brett, while I am trying to see who can help with this, we made the imported data for Wikidata 2022-11 available here:
https://kgtk.isi.edu/#/data
Perhaps that helps you, for now, to proceed with your work?

@brett--anderson
Copy link
Author

Hi! Thanks for that link, I'll try using the pre-processed version and that should get me unstuck for now. Thanks!

@tommasocarraro
Copy link

Hello everyone!

I have a similar problem. After about one hour of execution, my Mac shut down unexpectedly. This if the maximum number of processors is used. If, instead, I maintain 6 as this number, I have a similar problem to what was reported.

My question is the following: I am giving a look at the wikidata kgtk files provided by @filievski. Where can I find the node.tsv, edge.tsv, and qualifier.tsv files?

I can see many files and this is a bit unintuitive.

@tommasocarraro
Copy link

I come back to this issue.

After several tries, I launched the import in a cluster of my affiliation, with 32 cores and 250 GB of RAM.

I launched the same command of this issue, but with --procs set to 32.

At some point, the program stops giving outputs. The first time, I set a time limit of one day. The job got stuck before that time. On the second try, I put a time limit of one week. Similar to the first trial, the program got stuck. After one day it was stuck, the node in the cluster crashed, probably due to finished memory.

Could you kindly explain as one could have this wikidata successfully imported?

Please, also explain to my previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants