Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with continue and fork mode terminated due to unhandled system error #1098

Open
drremo1 opened this issue Apr 1, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@drremo1
Copy link

drremo1 commented Apr 1, 2023

Bug Description

Hello, I've been trying to train the sota/2019 pre-trained dev-clean transformer model for 1 more epoch using flashlight's train continue mode. However, it fails to start training as the pre-trained models from wav2letter are not compatible with flashlight. I then installed wav2letter v0.2 to try retraining the pre-trained models using train continue but it fails and shows this error:

I0401 12:50:50.303958 29472 Train.cpp:80] Parsing command line flags
I0401 12:50:50.303997 29472 Train.cpp:81] Overriding flags should be mutable when using `continue`
terminate called after throwing an instance of 'std::runtime_error'
  what():  unhandled system error
*** Aborted at 1680324650 (unix time) try "date -d @1680324650" if you are using GNU date ***
PC: @     0x7f4669af9e87 gsignal
*** SIGABRT (@0x3e800007320) received by PID 29472 (TID 0x7f4697288380) from PID 29472; stack trace: ***
    @     0x7f468f59f980 (unknown)
    @     0x7f4669af9e87 gsignal
    @     0x7f4669afb7f1 abort
    @     0x7f466a4ee957 (unknown)
    @     0x7f466a4f4ae6 (unknown)
    @     0x7f466a4f4b21 std::terminate()
    @     0x7f466a4f4d54 __cxa_throw
    @     0x55673215c6f8 fl::detail::ncclCheck()
    @     0x55673215ddd7 fl::distributedInit()
    @     0x5567320cb387 w2l::initDistributed()
    @     0x556731e3eab2 main
    @     0x7f4669adcc87 __libc_start_main
    @     0x556731ea7e4a _start
Aborted

I tried using train fork and still the error persists. This error does not occur using train alone.

Reproduction Steps

This is what I ran:

wav2letter/build/Train continue /mnt/d/198 --minloglevel=0 --logtostderr=1 --rndv_filepath=

Is there other way to try to train the pretrained models for just 1 epoch?

@drremo1 drremo1 added the bug Something isn't working label Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant