Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃[question]tensorflow runtime error #5787

Open
YUjunuuuuu opened this issue Jan 20, 2023 · 3 comments
Open

馃[question]tensorflow runtime error #5787

YUjunuuuuu opened this issue Jan 20, 2023 · 3 comments
Labels

Comments

@YUjunuuuuu
Copy link

Describe your question

I create a experiment about imagenet and the framework is tensorflow. However, it seems that there is some error in my experiment. The log file is below. We implements the keras.TFKerasTrial in model_def.py. However, it only trains for a while, then the error occurs. Can you give me some suggestions?

experiment_89_trial_83_logs.txt

@rb-determined-ai
Copy link
Member

Sorry, I meant to answer you last week but got distracted.

Process 10 exit with status code 247

247 would indicate a process died due to being kill -9'd. If you didn't kill it, it likely was killed by the OOM killer. You might need to try a smaller batch size or something, or perhaps turn on profiling and watch your memory usage.

@YUjunuuuuu
Copy link
Author

ok, I will try it and answer you later

@YUjunuuuuu
Copy link
Author

I do experiments on imagenet and decrease the batch size to 128(gpu 3090*16, 8 batch per gpu), howerver, it dosen't work.Here are my config files.
[experiment_99_trial_93_logs.txt](https://github.com/determined-ai/det
exp99_distributed_yaml.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants