Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run example_completion.py on CodeLlama-7b #203

Open
aditya4d1 opened this issue Feb 17, 2024 · 4 comments
Open

Unable to run example_completion.py on CodeLlama-7b #203

aditya4d1 opened this issue Feb 17, 2024 · 4 comments

Comments

@aditya4d1
Copy link

Hi,
I have a single GPU on my system and I am using CodeLlama-7b to test my environment.
I am running into the following error when I run the sample.

$ torchrun --nproc_per_node 1 example_completion.py    \
 --ckpt_dir CodeLlama-7b    \
 --tokenizer_path CodeLlama-7b/tokenizer.model    \
 --max_seq_len 128 --max_batch_size 1
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/aditya/rb16/Code/llama-ft/codellama/example_completion.py", line 53, in <module>
    fire.Fire(main)
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/rb16/Code/llama-ft/codellama/example_completion.py", line 20, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/home/aditya/rb16/Code/llama-ft/codellama/llama/generation.py", line 102, in build
    checkpoint = torch.load(ckpt_path, map_location="cpu")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1026, in load
    return _load(opened_zipfile,
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1438, in _load
    result = unpickler.load()
             ^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1408, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1373, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 5] Input/output error
[2024-02-17 13:26:43,422] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3852309) of binary: /home/aditya/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/aditya/anaconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-17_13:26:43
  host      : stormbreaker
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3852309)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

$ ls -ltr ./CodeLlama-7b
total 13169098
-rw-rw-r-- 1 aditya aditya      500058 Aug 21 14:32 tokenizer.model
-rw-rw-r-- 1 aditya aditya         163 Aug 21 14:32 params.json
-rw-rw-r-- 1 aditya aditya 13477187307 Aug 21 14:32 consolidated.00.pth
-rw-rw-r-- 1 aditya aditya         150 Aug 21 14:32 checklist.chk
$ echo $CUDA_VISIBLE_DEVICES
0

The conda env

channels:
  - pytorch
  - nvidia
dependencies:
  - numpy
  - pandas
  - pytorch-cuda=12.1
  - pytorch
  - torchvision
  - torchaudio
variables:
  CUDA_PATH: /usr/local/cuda-12.1
@jgehring
Copy link
Contributor

jgehring commented Feb 20, 2024

Hi @aditya4d1, to rule out corrupted files (which the error message seems to point to), can you run md5sum -c checklist.chk in the CodeLlama-7b directory?

@aditya4d1
Copy link
Author

aditya4d1 commented Feb 20, 2024

@jgehring

md5sum: consolidated.00.pth: Input/output error
consolidated.00.pth: FAILED open or read
params.json: OK
tokenizer.model: OK
md5sum: WARNING: 1 listed file could not be read

should i re-download the weights?

Update:
Re-downloaded the weights. Ran into checksum error again

Checking checksums
consolidated.00.pth: FAILED
params.json: OK
tokenizer.model: OK
md5sum: WARNING: 1 computed checksum did NOT match

@aditya4d1
Copy link
Author

ping @jgehring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants