Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few questions for the usage #1

Open
jinserk opened this issue Aug 10, 2020 · 7 comments
Open

A few questions for the usage #1

jinserk opened this issue Aug 10, 2020 · 7 comments
Labels
question Further information is requested

Comments

@jinserk
Copy link

jinserk commented Aug 10, 2020

It's really fantastic! Thank you so much for sharing this project.
I had a quick test with minio docker process and confirmed it works really well as expected.
I'd like to ask a few questions about the usage:

  • can I save any objects other than tensors, i.e. a tuple of tensors, a dict or a sparse tensors? In this case how to give the attributes?
  • If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it?
  • If I use this in a distributed training, will the each dataloader occupy the amount of dataset? For example, if I'd like to use this with PyTorch DDP, then each process will have its own DataLoader and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not.
  • It looks like supporting only PyTorch and TensorFlow now, but how about the numpy array or matrix for scikit-learn or XGBoost? Can I store some numpy objects as well?
@graykode
Copy link
Owner

graykode commented Aug 11, 2020

@jinserk Thank you for your interest in the project!

  • can I save any objects other than tensors, i.e. a tuple of tensors, a dict or a sparse tensors? In this case how to give the attributes?: As can be seen in the existing example, tensors in the form of tuple or dict can also be stored. We do not define a new operation for sparse sensors, but we can save it. As for the tensor for sparse, we will add it to the long-term plan. Below is an example of storage in the form of a tuple :
attributes=[
        ('image', 'float32', (1, 28, 28)),
        ('target', 'int64', (1))
    ]

traindata_saver({
        'image': image,
        'target': target
    })
  • If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it? : If you simply add more data (append mode) does not matter if you save using the existing config. However, if you want to refresh your data, it is not currently implemented in code. If you want to refresh dataset (this means the same as you want to remove buckets), you should use minio web site or mc command from minio. (mc rb --force --dangerous local/<bucket_name>). I will further implement the refresh method by adding a new option on datasaver like this:
traindata_saver({
        'image': image,
        'target': target
    }, refresh=True) # I will add this argument
  • If I use this in a distributed training, will the each dataloader occupy the amount of dataset? For example, if I'd like to use this with PyTorch DDP, then each process will have its own DataLoader and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not. :I understood this question as follows: With distributed data parallel (DDP), N data is replicated as many nodes as M, and the total number of N*M data is not generated? Yes, that's right. When dataset of torch.matorage is initialized, it goes through a logic to pre-download data corresponding to config (https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L80). However, this is very inefficient, So we recommend using the network access storage (NAS) option when using DDP. No new downloads of data are available when using NAS. Therefore, N datasets can be kept intact and DDP train can be performed.
  • It looks like supporting only PyTorch and TensorFlow now, but how about the numpy array or matrix for scikit-learn or XGBoost? Can I store some numpy objects as well? : When the tensor in Pytorch(torch.tensor) and Tensorflow(tf.Tensor) are saved, they are converted to numpy and saved. (As with tfrecord, the tensor can be encoded in the proto-buf format, but because of its lack of universality, the most popular numpy format was used.) Therefore, if the storage format is numpy array, you can save both scikit-learn or XGBOost. But of course, when it comes to data loaders, new additional implementations are needed. See how numpy array to be save! : (https://github.com/graykode/matorage/blob/master/tests/test_datasaver.py#L247)

Best Regard Tae Hwan

@graykode graykode pinned this issue Aug 11, 2020
@jinserk
Copy link
Author

jinserk commented Aug 11, 2020

@graykode Thank you very much for the detailed answers! It's really helpful, and very impressive.

My first question is actually related to the heterogeneous shape of tensors, which means the case that the image size, in your mnist example, could be changed as sample by sample. Practically I'm working on a chemical problem -- molecule classification in chemistry or pharma companies -- and the input feature can be graphs whose sizes are various according to the molecules. I know this cannot be simply implemented using an attributes, and that's why I asked about the sparse matrices support. Hope this could be implemented and be used sooner! :)

@graykode
Copy link
Owner

graykode commented Aug 11, 2020

@jinserk
Matrices with atypical shapes are difficult to store regardless of sparse. Moreover, sparse is not difficult to implement because it is guaranteed by through scipy(https://github.com/appier/h5sparse). However, it is very difficult to store a tensor with an undefined shape as hdf5.

I have a question, in order to make a model of pytorch, all input shapes must be the same, but I am curious how a tensor of heterogeneous shape can be an input of a model.

@jinserk
Copy link
Author

jinserk commented Aug 11, 2020

Good question. Basically I'm using a fixed shape for the input of the model. In the training, I just pad the heterogeneous shapes of the input with a fixed shape of max dim values. I made a quick test to store all my dataset in the form of padded dense matrices and got almost 100 times bigger stored file, which is totally impractical.

@graykode
Copy link
Owner

graykode commented Aug 12, 2020

If so, how about storing the fixed tensor itself that goes into the model's input in matorage? This is the core idea of matorage.
Also using high compressor leve(7~9) can helps sparse matrix to store better.

@jinserk
Copy link
Author

jinserk commented Aug 12, 2020

Thanks for the suggestion, @graykode ! I will try to do so, since I don't know how the "compression" will work well. I had once tried to store the fixed tensors but the serialized file was almost 400 GB (used torch.save) while the file with sparse tensor was only 4 GB. I still hope the storing sparse tensor will be supported in this matorage project soon. :)

@graykode
Copy link
Owner

graykode commented Aug 12, 2020

@jinserk

I understand that there is currently no official support for hdf5 for sparse matrices.(This is not impossible to implement. Actually, there is currently an implementation such as https://github.com/appier/h5sparse) Therefore, the official pytable document also recommends compressing for sparse matrix.

In fact, according to many sources, it is recommended to use compression when using the hdf5 format. (https://stackoverflow.com/a/25678471/5350490) According to this, since it is a sparse matrix, the original 512MB of data can be compressed up to 4.5KB. So, can you experiment with your 400GB data and give you the final compressed result size? Please try to compression='gzip' with level=9 and let me know how many sizes are compressed!

In addition, apart from this, we will add a mechanism for sparse to our long-term plans!!

@graykode graykode added the question Further information is requested label Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants