Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SummaryReader #577

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

[WIP] SummaryReader #577

wants to merge 2 commits into from

Conversation

dsuess
Copy link

@dsuess dsuess commented Apr 28, 2020

The missing support for reading tensorboard files was raised in #318. This PR adds support for iterating over tensorboard files. It's currently work-in-progress and I want to use this PR to discuss further development.

Currently, SummaryReader reads a single tfevents file and yields the parsed Event protobuf objects similar to the summary_iterator function from tensorflow.python.summary.summary_iterator. Under the hood, I use a refactored version of PyRecordReader_New from tensorboard.compat.tensorflow_stub.pywrap_tensorflow to iterate over the records and SummaryReader only parses the protobuf Events.

How should we continue from here? One thing I wasn't sure about is whether we want to keep the current interface or convert the Event-objects into more pythonic objects, e.g. dicts.

@lanpa
Copy link
Owner

lanpa commented May 4, 2020

Thanks for your contribution! Three issues come to my mind:

  1. Whether dump sequentially or random access:
  • A typical size would be 1 GB. Since the global_step is saved in the event proto, we need to decode each event before finding the event with specific global_step. So an additional data structure is needed for fast random access.
  1. What is the format of the extracted data:
  • The image is saved in encoded format, and TensorboardX supports GIF format. It is trivial to save them as files (with the Magic Bytes?) But a better target would be a NumPy array (because it's a reader). How about the histogram or the audio plugin?
  1. Usefulness to dump different data types (scalar, image, ...)
    The scalar can be downloaded as json or csv file from tensorboard webpage. Image can be downloaded as well. But with a reader, users can get the scalar values without playing with json or csv file. (merit)

Personally, I want to save all images of an experiment. <- maybe too much data.
Then I would like to pass an additional parameter tag to the summaryReader to filter the image I want. Here's how I would design the interface (image plugin only):

SummaryReader(filename, build_index=True)
encoded_images = reader.read_image('my_tag')
encoded_image = reader.read_image('my_tag', global_step=5)
encoded_image = reader.read_image('my_tag', global_step=7)
image = reader.read_image_as_numpy('my_tag', global_step=8)

SummaryReader(filename, build_index=False)
filenames = reader.read_image('my_tag', dump=True)
filenames = reader.dump_images('my_tag')  # I think it's better

What are your use cases?

@dsuess
Copy link
Author

dsuess commented May 5, 2020

Thanks for your feedback. My main goal so far was to replace the summary_iterator from TensorFlow, which does what the current implementation does. We use it mainly for parsing the results from a tfevent file into a DataFrame for further processing or visualization.

Regarding your questions:

  1. Sequentially is definitely easier, I'd have to read the TF source code to see how they deal with random access.

  2. Agreed, it would be nice if we don't just return the raw protobuf objects, put convert them into sth more pythonic. I think it's easy for scalars and lists. Histograms could be converted to TF's histogram datastructure represented by two numpy arrays, which would be cheap too. For images, I would check if we can handle the image decoding lazily, e.g. through pillow.

@kaiwenw
Copy link

kaiwenw commented Jul 31, 2020

Hey @dsuess thanks for the PR! Our team is also very interested in this feature. I'm wondering are you still working on it and is there an ETA?

@dsuess
Copy link
Author

dsuess commented Jul 31, 2020

Hi @kaiwenw, I'd love to keep working on this. What are your use cases? Currently, it's a bit rough and limited, but it does what I need.

@kaiwenw
Copy link

kaiwenw commented Jul 31, 2020

Hi @dsuess, we usually need to retrieve the end of the log, mostly for debugging purposes. For ex. we have hard cutoffs in integration tests, and it would be nice to retrieve end of log programmatically in Notebook as well.

As for data types, probably just need a list of scalars, histograms and maybe embeddings. (no images or audio needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants