Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError(assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))) #21

Open
jinserk opened this issue Aug 28, 2020 · 13 comments
Labels
bug Something isn't working question Further information is requested

Comments

@jinserk
Copy link

jinserk commented Aug 28, 2020

Can I ask you what this error stands for?

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 65, in set_datasets
    dataset = MatorageAnnDataset(trainset_config, clear=True)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 73, in __init__
    super(Dataset, self).__init__(config, **kwargs)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 80, in __init__
    self._init_download()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 189, in _init_download
    assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))
AssertionError
@graykode graykode added the question Further information is requested label Aug 28, 2020
@graykode
Copy link
Owner

graykode commented Aug 28, 2020

Of course. Could you show all files related to metadata? (Represents a file within metadata.)

@jinserk
Copy link
Author

jinserk commented Aug 28, 2020

Here is the only file in metadata dir. The dataset name and host/port info have been censored for security. Thank you!
6bd037556e8842d6.zip

@jinserk
Copy link
Author

jinserk commented Aug 28, 2020

If I commented out the assertion, anyway it works to retreive data from the minio server. However, I found lots of annoying loggings as:

2020/08/28 17:16:43 EDT [INFO] mlmanager.torch.workers (workers.py:56) set device cpu as rank 0                                                                                                                                                                                 
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
2020/08/28 17:17:06 EDT [INFO] mlmanager.torch.workers (workers.py:316) train:  epoch 0001  lr 5.0000e-04  loss 0.191976                                                                                                                                                        
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:350) validate:  epoch 0001  loss 0.099826                                                                                                                                                                    
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:237) epoch 0001  ave_train_loss 0.191976  ave_val_loss 0.099826             

Can I turn them off? Sorry for lots of questions and bug reports.

@graykode
Copy link
Owner

graykode commented Aug 29, 2020

@jinserk

Thank you for the detailed bug report!

While analyzing the bug you showed, I was able to find a few more bugs related to the NAS.
First, it is a part that cannot read the sub-JSON files of metadata well, which was solved by modifying the list_object function of NAS :

    def list_objects(self, bucket_name, prefix="", recursive=False):
        _foldername = os.path.join(self.path, bucket_name, prefix)
        if not recursive:
            objects = [
                os.path.join(prefix, f) for f in os.listdir(_foldername)
            ]
        else:
            objects = [
                os.path.join(dp, f) for dp, dn, fn in os.walk(_foldername) for f in fn
            ]
        return [Obj(o) for o in objects if o.startswith(prefix)]

The second one is related to assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)).
This error has been confirmed to be caused by a mismatch between the metadata on the remote server and the cached metadata.

This 'caching' serves to map the location of the downloaded file and the key of the minio when calling the dataset. If you use the NAS setting, you don't actually need this caching.

Solution

  • First, delete the matorage cache. (rm ~/.matorage/*.json)
  • Second, Apply the hotfix code

One thing I'd like to ask is, did you use the ip4 address when using the NAS settings?

@jinserk
Copy link
Author

jinserk commented Aug 30, 2020

Yes I used IPv4 address. I'll check your solution ASAP. Thank you so much for the prompt solution!

@graykode
Copy link
Owner

@jinserk
When using a NAS, you must use a local address rather than ipv4.

For example:

from matorage import DataConfig

# NAS example
data_config = DataConfig(
    endpoint='/tmp/shared',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)

If you use ipv4 for the endpoint, connection is established through HTTP protocol.
However, use the local path for the endpoint, It's much faster because it doesn't use the Http protocol. (Just file copy from folder to folder)
Also, If you use an http endpoint in the dataloader, data is downloaded to all nodes unconditionally.
Check this code might be helpful: https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L178

@jinserk
Copy link
Author

jinserk commented Aug 31, 2020

Hi @graykode,
Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.

By the way, I have two questions related to this:

  • When using the local path addressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned by root, since the minio server runs with the root permission. However, when I use the local path addressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct?
  • If the local path addressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namely rank0 node here), then can I set the rank0 node as local_path addressing but the other rank nodes uses ipv4 addressing at the same time?

@jinserk
Copy link
Author

jinserk commented Aug 31, 2020

It looks I cannot use the local_path addressing and ipv4 addressing at the same time:

Process TrainProcess-1:
Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 64, in set_datasets
    trainset_config = nas.DataConfig.from_json_file("train.json")
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 312, in from_json_file
    return cls(**config_dict)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 131, in __init__
    self._check_all()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 140, in _check_all
    self._check_bucket()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 242, in _check_bucket
    raise ValueError(
ValueError: Already created endpoint(/mnt/hdd1/kyu/matorage) doesn't current endpoint str(127.0.0.1:9000) It may occurs permission denied error

@graykode
Copy link
Owner

@jinserk

Hi @graykode,
Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.

By the way, I have two questions related to this:

  • When using the local path addressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned by root, since the minio server runs with the root permission. However, when I use the local path addressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct?
  • If the local path addressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namely rank0 node here), then can I set the rank0 node as local_path addressing but the other rank nodes uses ipv4 addressing at the same time?
  • First question: In macOS, I did not encounter such a permission error, but in Linux OS, a permission error was found. Thank you for doing the troubleshooting. This seems to be a minio-related error, so I'll find a solution.
  • Second question: As far as I know this is possible. In other words, nodes that physically use NAS can access them through the local path, and other nodes can access them through the HTTP protocol.

@graykode
Copy link
Owner

@jinserk

I found the solution related to the first one.
This is the way to use binary minio without using minio docker: https://github.com/minio/minio#gnulinux

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
# minio for background running
nohup ./minio gateway nas /home/nlkey2022/shared &

I don't know why we get a permission error in minio docker nas. I will leave an issue on the minio once.

@jinserk
Copy link
Author

jinserk commented Aug 31, 2020

@graykode
Guess this is because when using docker it runs as the root but when using local binary it runs with a user permission. I guess if you're run minio with the root permission, it will be the same:

sudo -H nohup ./minio gateway nas /home/nlkey2022/shared &

In my quick and humble opinion, we need to check the minio's set_bucket_policy to set the files or dirs to public. Please check here even though it's minio-java, not the minio-py. Of course I could be wrong and I'm afraid of misleading.

@graykode
Copy link
Owner

graykode commented Sep 1, 2020

@jinserk

I don't actually know the detailed configuration of the minio. So I will consider it. Thank you.

I'll leave a thread when I find more options!! :)

graykode added a commit that referenced this issue Sep 20, 2020
@graykode graykode changed the title AssertionError AssertionError(assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))) Nov 17, 2020
@graykode
Copy link
Owner

A step-by-step look at why this error occurs is as follows.

  1. In the dataset, the minio was updated with the same dataset_name and dataset_additional.
  2. However, json cached locally, that is, files in the ~/.matorage folder are not updated.
  3. Currently, the files in the ~/.matorage folder must be manually deleted, but the related logic must be additionally implemented later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants