You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training Details: I trained on the DAVIS train-val dataset (90 videos of about 80 frames) for 400k iterations on each UNet for a total of 800k iterations. I used the ElucidatedImagen with Unet3D. There is no text prompt. I trained on two Titan RTX GPUs with 24 GB memory. The UNets embedding dimensions are both 64. The low and "high" resolution UNets are 64x64 and 128x128. The UNets are trained on 12 frames and 3 frames, respectively, with a temporal downsampling of two for the first UNet. The batch size is 4 and 2, respectively.
Results: I am not sure what to think of the outcome. I am happy something happened 😄 but its not an impressive result. I suspect maybe its my small embedding size. The final videos look a bit memorized, and the temporal consistency is not very good. There is also seemingly limited diversity in my results. I include an example videos from 200k, 300k, and 400k iterations when training the second UNet below:
200k
300k
400k
Checkpoints: I have checkpoint files, but I don't know how to share them. The file size is pretty big (1.5 GB), and I can't upload them to google drive. If someone is interested and has a recommended way of sharing the weights, I can do so.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Training Details: I trained on the DAVIS train-val dataset (90 videos of about 80 frames) for 400k iterations on each UNet for a total of 800k iterations. I used the ElucidatedImagen with Unet3D. There is no text prompt. I trained on two Titan RTX GPUs with 24 GB memory. The UNets embedding dimensions are both 64. The low and "high" resolution UNets are 64x64 and 128x128. The UNets are trained on 12 frames and 3 frames, respectively, with a temporal downsampling of two for the first UNet. The batch size is 4 and 2, respectively.
Results: I am not sure what to think of the outcome. I am happy something happened 😄 but its not an impressive result. I suspect maybe its my small embedding size. The final videos look a bit memorized, and the temporal consistency is not very good. There is also seemingly limited diversity in my results. I include an example videos from 200k, 300k, and 400k iterations when training the second UNet below:
200k
300k
400k
Checkpoints: I have checkpoint files, but I don't know how to share them. The file size is pretty big (1.5 GB), and I can't upload them to google drive. If someone is interested and has a recommended way of sharing the weights, I can do so.
Beta Was this translation helpful? Give feedback.
All reactions